I wrote: > "Donald Fraser" <postg...@kiwi-fraser.net> writes: >> Using the default tsearch configuration, for 'english', text is being >> wrongly parsed into the tsvector type.
> ts_debug shows that it's being parsed like this: > alias | description | token > | dictionaries | dictionary | lexemes > > -----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------ > tag | XML tag | <span lang="EN-GB"> > | {} | | > protocol | Protocol head | http:// > | {} | | > url | URL | > www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | > {www.harewoodsolutions.co.uk/press.aspx} > host | Host | > www.harewoodsolutions.co.uk | {simple} | simple | > {www.harewoodsolutions.co.uk} > url_path | URL path | /press.aspx</span><span > | {simple} | simple | {/press.aspx</span><span} > blank | Space symbols | > | {} | | > asciiword | Word, all ASCII | lang > | {english_stem} | english_stem | {lang} > ... etc ... > ie the critical point seems to be that url_path is willing to soak up a > string containing "<" and ">", so the span tags don't get recognized as > separate lexemes. While that's "obviously" the wrong thing in this > particular example, I'm not sure if it's the wrong thing in general. > Can anyone comment on the frequency of usage of those two symbols in > URLs? > In any case it's weird that the URL lexeme doesn't span the same text > as the url_path one, but I'm not sure which one we should consider > wrong. I poked at this a bit. The reason for the inconsistency between the url and url_path lexemes is that the InURLPathStart state transitions directly to InURLPath, which is *not* consistent with what happens while parsing the URL as a whole: p_isURLPath() starts the sub-parser in InFileFirst state. The attached proposed patch rectifies that by transitioning to InFileFirst state instead. A possible objection to this fix is that you may get either a "file" or a "url_path" component lexeme, where before you always got "url_path". I'm not sure if that's something to worry about or not; I'd tend to think there's nothing much wrong with it. The other change in the attached patch is to make InURLPath parsing stop at "<" or ">", as per discussion. With these changes I get regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>'); alias | description | token | dictionaries | dictionary | lexemes ----------+-------------------+----------------------------------------+--------------+------------+------------------------------------------ protocol | Protocol head | http:// | {} | | url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx} host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk} file | File or path name | /press.aspx | {simple} | simple | {/press.aspx} tag | XML tag | </span> | {} | | (5 rows) as compared to the prior behavior regression=# SELECT * from ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>'); alias | description | token | dictionaries | dictionary | lexemes ----------+---------------+----------------------------------------+--------------+------------+------------------------------------------ protocol | Protocol head | http:// | {} | | url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx} host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk} url_path | URL path | /press.aspx</span> | {simple} | simple | {/press.aspx</span>} (4 rows) Neither change affects the current set of regression tests; but none the less there's a potential compatibility issue here, so my thought is to apply this only in HEAD. Comments? regards, tom lane
Index: src/backend/tsearch/wparser_def.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.29 diff -c -r1.29 wparser_def.c *** src/backend/tsearch/wparser_def.c 26 Apr 2010 17:10:18 -0000 1.29 --- src/backend/tsearch/wparser_def.c 26 Apr 2010 19:17:48 -0000 *************** *** 1504,1521 **** {p_isEOF, 0, A_POP, TPS_Null, 0, NULL}, {p_iseqC, '"', A_POP, TPS_Null, 0, NULL}, {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL}, {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL}, {NULL, 0, A_POP, TPS_Null, 0, NULL}, }; static const TParserStateActionItem actionTPS_InURLPathStart[] = { ! {NULL, 0, A_NEXT, TPS_InURLPath, 0, NULL} }; static const TParserStateActionItem actionTPS_InURLPath[] = { {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL}, {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL}, {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL}, {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL} }; --- 1504,1526 ---- {p_isEOF, 0, A_POP, TPS_Null, 0, NULL}, {p_iseqC, '"', A_POP, TPS_Null, 0, NULL}, {p_iseqC, '\'', A_POP, TPS_Null, 0, NULL}, + {p_iseqC, '<', A_POP, TPS_Null, 0, NULL}, + {p_iseqC, '>', A_POP, TPS_Null, 0, NULL}, {p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL}, {NULL, 0, A_POP, TPS_Null, 0, NULL}, }; static const TParserStateActionItem actionTPS_InURLPathStart[] = { ! /* this should transition to same state that p_isURLPath starts in */ ! {NULL, 0, A_NEXT, TPS_InFileFirst, 0, NULL} }; static const TParserStateActionItem actionTPS_InURLPath[] = { {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL}, {p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL}, {p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL}, + {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL}, + {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL}, {p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL} };
-- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs