Re: [HACKERS] fulltext parser strange behave

Andrew Dunstan Thu, 08 Nov 2007 12:13:16 -0800


Andrew Dunstan wrote:

Tom Lane wrote:
Andrew Dunstan <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
Well, the state machine definitely thinks that tag names shouldcontainonly ASCII letters (with possibly a leading or trailing '/').Given theHTML examples I suppose we should allow non-first digits too. Isthere
anything else that should be considered a tag?  What about dash and
underscore for instance?
The docs say we specifically accept HTML tags. Are we really justaccepting anything that is a string of ASCII letters as the tagname? Then we should adjust the docs. <foo> and <foo1234> are notHTML tags.
I don't think I want to try to maintain a list of exactly which
identifiers are considered valid tag names ... and if I did, I wouldn't
put it into the parser.  It would be a dictionary's job to tell valid
from invalid tag names, no?
I don't have a quarrel with that. But then we should be more clearabout what we are recognizing. We could describe the thing as anHTML-like tag, possibly. I think the same probably goes for entities too.

I've just been looking at the state machine in wparser_def.c. I thinkthe processing for entities is also a few bob short in the pound. Itrecognises decimal numeric character references, but nor hexadecimalnumeric character references. That's fairly silly since the HTML specspecifically says the latter are "particularly useful". The rules fornamed entities are also deficient w.r.t. digits, just like the case oftags that Tom noticed. This isn't academic: HTML features a number ofnamed entities with digits in the name (sup2, frac14 for example).

In XML at least, legal names are defined by the following rules from thespec:

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |[#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] |[#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF][4a] NameChar ::= NameStartChar | "-" | "." | [0-9] |#xB7 | [#x0300-#x036F] | [#x203F-#x2040]

[5]       Name       ::=       NameStartChar (NameChar)*

Restricting this to ASCII, we get:

[4]       NameStartChar       ::=       ":" | [A-Z] | "_" | [a-z]
[4a]       NameChar       ::=       NameStartChar | "-" | "." | [0-9]
[5]       Name       ::=       NameStartChar (NameChar)*

or this regex for Name:

[A-Za-z:_][A-Za-z0-9:_.-]*

I suggest we use that or something very close to it as the rule fornames in these patterns.


cheers

andrew


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] fulltext parser strange behave

Reply via email to