I wrote:
> "Donald Fraser" <[email protected]> writes:
>> Using the default tsearch configuration, for 'english', text is being
>> wrongly parsed into the tsvector type.
> ts_debug shows that it's being parsed like this:
> alias | description | token
> | dictionaries | dictionary | lexemes
>
> -----------------+---------------------------------+----------------------------------------+----------------+--------------+------------------------------------------
> tag | XML tag | <span lang="EN-GB">
> | {} | |
> protocol | Protocol head | http://
> | {} | |
> url | URL |
> www.harewoodsolutions.co.uk/press.aspx | {simple} | simple |
> {www.harewoodsolutions.co.uk/press.aspx}
> host | Host |
> www.harewoodsolutions.co.uk | {simple} | simple |
> {www.harewoodsolutions.co.uk}
> url_path | URL path | /press.aspx</span><span
> | {simple} | simple | {/press.aspx</span><span}
> blank | Space symbols |
> | {} | |
> asciiword | Word, all ASCII | lang
> | {english_stem} | english_stem | {lang}
> ... etc ...
> ie the critical point seems to be that url_path is willing to soak up a
> string containing "<" and ">", so the span tags don't get recognized as
> separate lexemes. While that's "obviously" the wrong thing in this
> particular example, I'm not sure if it's the wrong thing in general.
> Can anyone comment on the frequency of usage of those two symbols in
> URLs?
> In any case it's weird that the URL lexeme doesn't span the same text
> as the url_path one, but I'm not sure which one we should consider
> wrong.
I poked at this a bit. The reason for the inconsistency between the url
and url_path lexemes is that the InURLPathStart state transitions
directly to InURLPath, which is *not* consistent with what happens while
parsing the URL as a whole: p_isURLPath() starts the sub-parser in
InFileFirst state. The attached proposed patch rectifies that by
transitioning to InFileFirst state instead. A possible objection to
this fix is that you may get either a "file" or a "url_path" component
lexeme, where before you always got "url_path". I'm not sure if that's
something to worry about or not; I'd tend to think there's nothing much
wrong with it.
The other change in the attached patch is to make InURLPath parsing
stop at "<" or ">", as per discussion.
With these changes I get
regression=# SELECT * from
ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
alias | description | token |
dictionaries | dictionary | lexemes
----------+-------------------+----------------------------------------+--------------+------------+------------------------------------------
protocol | Protocol head | http:// | {}
| |
url | URL | www.harewoodsolutions.co.uk/press.aspx |
{simple} | simple | {www.harewoodsolutions.co.uk/press.aspx}
host | Host | www.harewoodsolutions.co.uk |
{simple} | simple | {www.harewoodsolutions.co.uk}
file | File or path name | /press.aspx |
{simple} | simple | {/press.aspx}
tag | XML tag | </span> | {}
| |
(5 rows)
as compared to the prior behavior
regression=# SELECT * from
ts_debug('http://www.harewoodsolutions.co.uk/press.aspx</span>');
alias | description | token |
dictionaries | dictionary | lexemes
----------+---------------+----------------------------------------+--------------+------------+------------------------------------------
protocol | Protocol head | http:// | {}
| |
url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple}
| simple | {www.harewoodsolutions.co.uk/press.aspx}
host | Host | www.harewoodsolutions.co.uk | {simple}
| simple | {www.harewoodsolutions.co.uk}
url_path | URL path | /press.aspx</span> | {simple}
| simple | {/press.aspx</span>}
(4 rows)
Neither change affects the current set of regression tests; but none the
less there's a potential compatibility issue here, so my thought is to
apply this only in HEAD.
Comments?
regards, tom lane
Index: src/backend/tsearch/wparser_def.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v
retrieving revision 1.29
diff -c -r1.29 wparser_def.c
*** src/backend/tsearch/wparser_def.c 26 Apr 2010 17:10:18 -0000 1.29
--- src/backend/tsearch/wparser_def.c 26 Apr 2010 19:17:48 -0000
***************
*** 1504,1521 ****
{p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
{p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
{p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
{p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
{NULL, 0, A_POP, TPS_Null, 0, NULL},
};
static const TParserStateActionItem actionTPS_InURLPathStart[] = {
! {NULL, 0, A_NEXT, TPS_InURLPath, 0, NULL}
};
static const TParserStateActionItem actionTPS_InURLPath[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
{p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
{p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
{p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
};
--- 1504,1526 ----
{p_isEOF, 0, A_POP, TPS_Null, 0, NULL},
{p_iseqC, '"', A_POP, TPS_Null, 0, NULL},
{p_iseqC, '\'', A_POP, TPS_Null, 0, NULL},
+ {p_iseqC, '<', A_POP, TPS_Null, 0, NULL},
+ {p_iseqC, '>', A_POP, TPS_Null, 0, NULL},
{p_isnotspace, 0, A_CLEAR, TPS_InURLPath, 0, NULL},
{NULL, 0, A_POP, TPS_Null, 0, NULL},
};
static const TParserStateActionItem actionTPS_InURLPathStart[] = {
! /* this should transition to same state that p_isURLPath starts in */
! {NULL, 0, A_NEXT, TPS_InFileFirst, 0, NULL}
};
static const TParserStateActionItem actionTPS_InURLPath[] = {
{p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL},
{p_iseqC, '"', A_BINGO, TPS_Base, URLPATH, NULL},
{p_iseqC, '\'', A_BINGO, TPS_Base, URLPATH, NULL},
+ {p_iseqC, '<', A_BINGO, TPS_Base, URLPATH, NULL},
+ {p_iseqC, '>', A_BINGO, TPS_Base, URLPATH, NULL},
{p_isnotspace, 0, A_NEXT, TPS_InURLPath, 0, NULL},
{NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL}
};
--
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs