On Sat, 2009-04-25 at 21:19 +0800, Bo Yang wrote:
> 2. Which type of string should we intern?
> According 1, I propose to intern strings optionally. We should use
> lwc_string to restore the strings, which appears multiple times and
> which are necessary to compare frequently. Generally, I mean HTML tag
> name, attribute name, enum-like attribute value (such as attribute
> "display"'s value which include "inline", "block", "inline-block"...).
> And the id attribute value as well as the class attribute value are
> also good candidates for intern.
I think it will be simpler to just intern all tag names, attribute names
and attribute values.
Firstly, it avoids the additional complexity caused by having to parse
the attribute name to work out whether to intern the value or not.
Secondly, it makes all attribute values consistent, so you don't have to
litter everything with special cases.
Thirdly, I don't think the overhead of interning non-enumerated
attribute values will be noticeable and, if it does turn out to be, we
can consider the best way to deal with it _then_ rather than attempting
to optimise for this case now, when we've no real idea if it's a
worthwhile use of time & effort.
> 3. When to intern strings?
Yeah, hubbub will intern strings.
You're right that the libxml binding needs to perform the interning.
> 4. What does the dom_string look like?
> It should be changed to:
> struct dom_string {
> void *ptr; /**< Pointer to string data when
> len >= 0, else pointer to the lwc_string which contain the real
> string*/
>
> int len; /**< Byte length of string when
> len >=0 and if len == -1, it means the dom_string is intern in ptr */
I don't like the use of a magic value here.
> dom_alloc alloc; /**< Memory (de)allocation function */
> void *pw; /**< Client-specific data */
>
> uint32_t refcnt; /**< Reference count */
> };
>
> I hope the above comment explain my intent well. The member len takes
> two responsibility now. When it equals to -1, it means that the ptr
> points to a lwc_string, otherwise if the "len >= 0", it means the ptr
> points to a real string buffer. So, our dom_string comparison code
> will like:
One idea that Daniel had was that we shouldn't make dom_string wrap two
different underlying types and should just have two, distinct, string
types -- one for the structural stuff, and one for textual content.
I've not yet thought through the implications of this.
> 5. Some more consideration...
> When I think about the strings, I also suspect how we store a
> character? Now, we use uint8_t in libDOM, but I think we should use
> UTF-16 encoding in DOM. And use uint16_t to replace uint8_t.
Please don't do this. LibDOM (and, indeed, the entire rest of NetSurf)
is designed to operate on UTF-8 string data. It's far too late to
consider using UTF-16 instead.
J.