Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-08 Thread Andres Freund
On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
 While playing around/evaluating tsearch I notices that to_tsvector is
 obscenely slow for some files. After some profiling I found that this is
  due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
  a multibyte encoding is in use TParserInit copies the whole remaining
  input and converts it to wchar_t or pg_wchar - for every email or protocol
  prefixed url in the the document. Which obviously is bad.
 
 I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
  which reuses the the already converted strings of the original TParser -
  only at different offsets.
 
 Another approach would be to get rid of the separate parser invocations -
 requiring a bunch of additional states. This seemed more complex to me, so
  I wanted to get some feedback first.
 
 Without patch:
 andres=# SELECT to_tsvector('english', document) FROM document WHERE
  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
 
  ──
 ─── ...
  (1 row)
 
 Time: 5835.676 ms
 
 With patch:
 andres=# SELECT to_tsvector('english', document) FROM document WHERE
  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
 
  ──
 ─── ...
  (1 row)
 
 Time: 395.341 ms
 
 Ill cleanup the patch if it seems like a sensible solution...
As nobody commented here is a corrected (stupid thinko) and cleaned up 
version. Anyone cares to comment whether I am the only one thinking this is an 
issue?

Andres
From cbdeb0bb636f3b7619d0a3019854809ea5565dac Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sun, 8 Nov 2009 16:30:42 +0100
Subject: [PATCH] Fix TSearch inefficiency because of repeated copying of strings

---
 src/backend/tsearch/wparser_def.c |   63 ++--
 1 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/src/backend/tsearch/wparser_def.c b/src/backend/tsearch/wparser_def.c
index 301c1eb..7bbd826 100644
--- a/src/backend/tsearch/wparser_def.c
+++ b/src/backend/tsearch/wparser_def.c
@@ -328,6 +328,41 @@ TParserInit(char *str, int len)
 	return prs;
 }
 
+/*
+ * As an alternative to a full TParserInit one can create a
+ * TParserCopy which basically is a normally TParser without a private
+ * copy of the string - instead it uses the one from another TParser.
+ * This is usefull because at some places TParsers are created
+ * recursively and the repeated copying around of the strings can
+ * cause major inefficiency.
+ * Obviously one may not close the original TParser before the copy.
+ */
+static TParser *
+TParserCopyInit(const TParser const* orig)
+{
+	TParser*prs = (TParser *) palloc0(sizeof(TParser));
+
+	prs-charmaxlen = orig-charmaxlen;
+	prs-usewide = orig-usewide;
+	prs-lenstr = orig-lenstr - orig-state-posbyte;
+
+	prs-str = orig-str + orig-state-posbyte;
+	if(orig-pgwstr)
+		prs-pgwstr = orig-pgwstr + orig-state-poschar;
+	if(orig-wstr)
+		prs-wstr = orig-wstr + orig-state-poschar;
+
+	prs-state = newTParserPosition(NULL);
+	prs-state-state = TPS_Base;
+
+#ifdef WPARSER_TRACE
+	fprintf(stderr, parsing copy \%.*s\\n, len, str);
+#endif
+
+	return prs;
+}
+
+
 static void
 TParserClose(TParser *prs)
 {
@@ -350,6 +385,26 @@ TParserClose(TParser *prs)
 }
 
 /*
+ * See TParserCopyInit
+ */
+static void
+TParserCopyClose(TParser *prs)
+{
+	while (prs-state)
+	{
+		TParserPosition *ptr = prs-state-prev;
+
+		pfree(prs-state);
+		prs-state = ptr;
+	}
+#ifdef WPARSER_TRACE
+	fprintf(stderr, closing parser copy);
+#endif
+	pfree(prs);
+}
+
+
+/*
  * Character-type support functions, equivalent to is* macros, but
  * working with any possible encodings and locales. Notes:
  *  - with multibyte encoding and C-locale isw* function may fail
@@ -617,7 +672,7 @@ p_isignore(TParser *prs)
 static int
 p_ishost(TParser *prs)
 {
-	TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte);
+	TParser *tmpprs = TParserCopyInit(prs);
 	int			res = 0;
 
 	tmpprs-wanthost = true;
@@ -631,7 +686,7 @@ p_ishost(TParser *prs)
 		prs-state-charlen = tmpprs-state-charlen;
 		res = 1;
 	}
-	TParserClose(tmpprs);
+	TParserCopyClose(tmpprs);
 
 	return res;
 }
@@ -639,7 +694,7 @@ p_ishost(TParser *prs)
 static int
 p_isURLPath(TParser *prs)
 {
-	TParser*tmpprs = TParserInit(prs-str + prs-state-posbyte, prs-lenstr - prs-state-posbyte);
+	TParser *tmpprs = TParserCopyInit(prs);
 	int			res = 0;
 
 	tmpprs-state = newTParserPosition(tmpprs-state);
@@ -654,7 +709,7 @@ p_isURLPath(TParser *prs)
 		prs-state-charlen = tmpprs-state-charlen;
 		res = 1;
 	}
-	TParserClose(tmpprs);
+	TParserCopyClose(tmpprs);
 
 	return res;
 }
-- 
1.6.5.12.gd65df24


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-08 Thread Kenneth Marshall
On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
 On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
  While playing around/evaluating tsearch I notices that to_tsvector is
  obscenely slow for some files. After some profiling I found that this is
   due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
   a multibyte encoding is in use TParserInit copies the whole remaining
   input and converts it to wchar_t or pg_wchar - for every email or protocol
   prefixed url in the the document. Which obviously is bad.
  
  I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
   which reuses the the already converted strings of the original TParser -
   only at different offsets.
  
  Another approach would be to get rid of the separate parser invocations -
  requiring a bunch of additional states. This seemed more complex to me, so
   I wanted to get some feedback first.
  
  Without patch:
  andres=# SELECT to_tsvector('english', document) FROM document WHERE
   filename = '/usr/share/doc/libdrm-nouveau1/changelog';
  
   
  ??
  ?
   ...
   (1 row)
  
  Time: 5835.676 ms
  
  With patch:
  andres=# SELECT to_tsvector('english', document) FROM document WHERE
   filename = '/usr/share/doc/libdrm-nouveau1/changelog';
  
   
  ??
  ?
   ...
   (1 row)
  
  Time: 395.341 ms
  
  Ill cleanup the patch if it seems like a sensible solution...
 As nobody commented here is a corrected (stupid thinko) and cleaned up 
 version. Anyone cares to comment whether I am the only one thinking this is 
 an 
 issue?
 
 Andres

+1

As a user of tsearch, I can certainly appreciate to speed-up in parsing --
more CPU for everyone else.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why do OLD and NEW have special internal names?

2009-11-08 Thread Josh Berkus
Tom,

 BTW, this brings up another point, which is that up to now it's often
 been possible to use plpgsql variable names that conflict with
 core-parser reserved words, so long as you didn't need to use the
 reserved word with its special meaning.  That will stop working when
 this patch goes in.  Doesn't bother me any, but if anyone thinks it's
 a serious problem, speak now.

Sounds like a potential *big* blocker to upgrading; anyone with several
thousand lines of plpgsql can't really afford to refactor away all of
the accidental uses of reserved words.

That being said, reusing reserved words in this way was always wonky, so
I'm not sure how many people will have done so.  Best way is to commit
it to alpha3, and try to get people to test.

--Josh Berkus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typed tables

2009-11-08 Thread Simon Riggs
On Thu, 2009-11-05 at 19:24 +0200, Peter Eisentraut wrote:

 This is useful in conjunction with PL/Proxy and similar RPC-type
 setups. On the frontend/proxy instances you only create the type, and
 the backend instances you create the storage for the type, and the
 database system would give you a little support keeping them in sync.
 Think interface and implementation.

Not sure I see why this is good. Why is issuing CREATE TYPE so much
easier than using CREATE TABLE? Is it worth the extra syntax and code to
support it? Can we do anything additional as a result of this? Is this
required by the standard or are we going past the standard?

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator exclusion constraints

2009-11-08 Thread Simon Riggs
On Sun, 2009-11-08 at 13:41 -0800, Jeff Davis wrote:
 On Sat, 2009-11-07 at 10:56 -0800, Jeff Davis wrote:
  EXCLUDE probably flows most nicely with the optional USING clause or
  without. My only complaint was that it's a transitive verb, so it seems
  to impart more meaning than it actually can. I doubt anyone would
  actually be more confused in practice, though. If a couple of people
  agree, I'll change it to EXCLUDE.
 
 It looks like EXCLUDE is the winner. Updated patch attached.
 
 The feature is still called operator exclusion constraints, and the
 docs still make reference to that name, but the syntax specification has
 been updated.

Don't think that name is very useful either... sounds like you want to
exclude operators, which is why I got lost in the first place. I'd call
them generic exclusion constraints or user-defined exclusion
constraints. Sorry for this.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

2009-11-08 Thread Andres Freund
On Sunday 08 November 2009 17:41:15 Kenneth Marshall wrote:
 On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
  As nobody commented here is a corrected (stupid thinko) and cleaned up
  version. Anyone cares to comment whether I am the only one thinking this
  is an issue?
  Andres
 +1
 As a user of tsearch, I can certainly appreciate to speed-up in parsing --
 more CPU for everyone else.
Please note that this is mostly an issue when using rather long documents 
including either email addresses (ie. an @) or links with protocol prefixes 
(like ?+://) - so it might not give you personally a benefit :-(

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specific names for plpgsql variable-resolution control options?

2009-11-08 Thread Alvaro Herrera
Tom Lane wrote:
 David E. Wheeler da...@kineticode.com writes:
  That works.
 
 plpgsql_variable_conflict = fatal | variable-first | column-first
 
 If we do that, presumably the per-function syntax would be
 
   #variable_conflict variable_first
 
 and so on, which is clear enough but might be thought a bit verbose
 for something people might be pasting into hundreds of functions.

If there's some way to paste it automatically (like, say, an
appropriate UPDATE incantation on pg_proc) then that doesn't seem like
an important problem.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specific names for plpgsql variable-resolution control options?

2009-11-08 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 If there's some way to paste it automatically (like, say, an
 appropriate UPDATE incantation on pg_proc) then that doesn't seem like
 an important problem.

True, you could do UPDATE pg_proc SET prosrc = 'foo' || prosrc
WHERE something-appropriate.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Specific names for plpgsql variable-resolution control options?

2009-11-08 Thread Sam Mason
On Fri, Nov 06, 2009 at 07:09:46PM -0800, David E. Wheeler wrote:
 On Nov 6, 2009, at 6:57 PM, Tom Lane wrote:
 If we do that, presumably the per-function syntax would be
  #variable_conflict variable_first
 and so on, which is clear enough but might be thought a bit verbose
 for something people might be pasting into hundreds of functions.
 
 I suspect that most folks will set the GUC and few will actually use  
 it in functions.

Just to be clear about the semantics; what should happen if the user
doesn't specify a value for the function?  Should PG remember the GUC
value at creation time, or pull it in at invocation time?  I'd lean
towards fixing it at function creation time as it'd be one more caveat
for security definer functions otherwise.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] next CommitFest

2009-11-08 Thread Robert Haas
The next CommitFest is scheduled to start in a week.  So far, it
doesn't look too bad, though a lot could change between now and then.

I would personally prefer not to be involved in the management of the
next CommitFest.  Having done all of the July CommitFest and a good
chunk of the September CommitFest, I am feeling a bit burned out.

One pretty major fly in the ointment is that neither Hot Standby nor
Streaming Replication has been committed or shows much sign of being
about to be committed.  I think this is bad.  These are big features
that figure to have some bugs and break some things.  If they're not
committed in time for alpha3, then there won't be any significant
testing of these prior to alpha4/beta1, at the earliest.  I think
that's likely to lead to either (1) a very long beta period followed
by a late release or (2) a buggy release.  I feel like Simon Riggs and
Fujii Masao really pulled out all the stops to get these ready in time
for the September CommitFest, and while I'm not in a hurry to break
the world, I think the sooner these can hit the tree, the better of
we'll be in terms of releasing 8.5.

Just my $0.02,

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] next CommitFest

2009-11-08 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 I would personally prefer not to be involved in the management of the
 next CommitFest.  Having done all of the July CommitFest and a good
 chunk of the September CommitFest, I am feeling a bit burned out.

You did yeoman work on both --- thanks for that!

Do we have another volunteer to do this for the November fest?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typed tables

2009-11-08 Thread Tatsuo Ishii
  This is useful in conjunction with PL/Proxy and similar RPC-type
  setups. On the frontend/proxy instances you only create the type, and
  the backend instances you create the storage for the type, and the
  database system would give you a little support keeping them in sync.
  Think interface and implementation.
 
 Not sure I see why this is good. Why is issuing CREATE TYPE so much
 easier than using CREATE TABLE? Is it worth the extra syntax and code to
 support it? Can we do anything additional as a result of this? Is this
 required by the standard or are we going past the standard?

+1. I'd like to hear from Peter why this is neccessary in the first
place.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] operator exclusion constraints

2009-11-08 Thread Jeff Davis
On Sun, 2009-11-08 at 22:03 +, Simon Riggs wrote:
 Don't think that name is very useful either... sounds like you want to
 exclude operators, which is why I got lost in the first place. I'd call
 them generic exclusion constraints or user-defined exclusion
 constraints. Sorry for this.

Either of those names are fine with me, too. The current name is a
somewhat shortened version of the name operator-based exclusion
constraints, so we can also just use that name. Or, just exclusion
constraints.

I'll leave the current patch as-is for now, and wait for some reviewer
comments. This is purely a documentation issue, so there are bound to be
a few of these things that I can clarify at once.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers