Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

Kevin Grittner Thu, 10 Dec 2009 10:10:45 -0800

I wrote:
 
> Thanks for the sample which shows the difference.
 
Ah, once I got on the right track, there is no problem seeing
dramatic improvements with the patch.  It changes some nasty O(N^2)
cases to O(N).  In particular, the fixes affect parsing of large
strings encoded with multi-byte character encodings and containing
email addresses or URLs with a non-IP-address host component.  It
strikes me as odd that URLs without a slash following the host
portion, or with an IP address, are treated so differently in the
parser, but if we want to address that, it's a matter for another
patch.
 
I'm inclined to think that the minimal differences found in some of
my tests probably have more to do with happenstance of code
alignment than the particulars of the patch.
 
I did find one significant (although easily solved) problem.  In the
patch, the recursive setup of usewide, pgwstr, and wstr are not
conditioned by #ifdef USE_WIDE_UPPER_LOWER in the non-patched
version.  Unless there's a good reason for that, the #ifdef should
be added.
 
Less critical, but worth fixing one way or the other, TParserClose
does not drop breadcrumbs conditioned on #ifdef WPARSER_TRACE, but
TParserCopyClose does.  I think this should be consistent.
 
Finally, there's that spelling error in the comment for
TParserCopyInit.  Please fix.
 
If a patch is produced with fixes for these three things, I'd say
it'll be ready for committer.  I'm marking it as Waiting on Author
for fixes to these three items.
 
Sorry for the delay in review.  I hope there's still time to get
this committed in this CF.
 
-Kevin


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] tsearch parser inefficiency if text includes urls or emails - new version

Reply via email to