On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote: > On Sunday 01 November 2009 16:19:43 Andres Freund wrote: > > While playing around/evaluating tsearch I notices that to_tsvector is > > obscenely slow for some files. After some profiling I found that this is > > due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If > > a multibyte encoding is in use TParserInit copies the whole remaining > > input and converts it to wchar_t or pg_wchar - for every email or protocol > > prefixed url in the the document. Which obviously is bad. > > > > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose > > which reuses the the already converted strings of the original TParser - > > only at different offsets. > > > > Another approach would be to get rid of the separate parser invocations - > > requiring a bunch of additional states. This seemed more complex to me, so > > I wanted to get some feedback first. > > > > Without patch: > > andres=# SELECT to_tsvector('english', document) FROM document WHERE > > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > > > > > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > ????????????????????????????????????????????????????????????????????????????????? > > ... > > (1 row) > > > > Time: 5835.676 ms > > > > With patch: > > andres=# SELECT to_tsvector('english', document) FROM document WHERE > > filename = '/usr/share/doc/libdrm-nouveau1/changelog'; > > > > > > ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? > > ????????????????????????????????????????????????????????????????????????????????? > > ... > > (1 row) > > > > Time: 395.341 ms > > > > Ill cleanup the patch if it seems like a sensible solution... > As nobody commented here is a corrected (stupid thinko) and cleaned up > version. Anyone cares to comment whether I am the only one thinking this is > an > issue? > > Andres
+1 As a user of tsearch, I can certainly appreciate to speed-up in parsing -- more CPU for everyone else. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers