On Sat, Jul 10, 2010 at 11:10 AM, Sausset François <[email protected]> wrote: > I just saw that when looking at the code by myself. > What do you exactly mean by a prefix tree?
http://en.wikipedia.org/wiki/Trie > I also noticed that the entity parser does not take into account combined > Unicode characters (see §A.3 in: http://www.w3.org/TR/xml-entity-names/). > In addition, even without entities, combined characters are displayed as > separate ones. My understanding is that is the correct behavior w.r.t. the HTML5 specification of entity parsing. Our entity processing aims for perfect compliance with this algorithm: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#tokenizing-character-references My belief is the only things we're missing for perfect compliance is the expanded list of entity names: http://www.whatwg.org/specs/web-apps/current-work/multipage/named-character-references.html#named-character-references and the prefix tree. Adam > Le 10 juil. 2010 à 21:00, Adam Barth a écrit : > Implementing MathML entities is not as easy as adding them to > HTMLEntityNames.gperf. The problem is our entity parsing code (both > the legacy entity parser and thew new HTML5 one we're using) assumes > that all named entities are <= 8 characters: > > http://trac.webkit.org/browser/trunk/WebCore/html/HTMLEntityParser.cpp#L194 > > Rather than just bumping up that number, we need to change the data > structure we use to store entities. Instead of a perfect hash, we > should use a prefix tree. In order to parse entities correctly > according to the spec, we need to know whether a given string is a > prefix of a named entity, which is what the prefix tree would tell us. > > Adam > _______________________________________________ webkit-dev mailing list [email protected] http://lists.webkit.org/mailman/listinfo.cgi/webkit-dev

