The dom_string design

Bo Yang Sat, 25 Apr 2009 06:20:12 -0700

Hi,
    After some thoughts about how to integrate lwc_string to libDOM, I
finally figure out a method. I propose as following, any advice are
welcomed.


1. Why we intern strings using lwc_string?
Of course, for squeezing space and efficient string comparison. But
also, interning string has its own cost, so I think interning all
strings in a webpage (or s/dom_string/lwc_string/g) is not a good
idea. There may be hundreds and thousands of strings, intern all these
will cause more collision in the hashtable and waste much memory.

2. Which type of string should we intern?
According 1, I propose to intern strings optionally. We should use
lwc_string to restore the strings, which appears multiple times and
which are necessary to compare frequently. Generally, I mean HTML tag
name, attribute name, enum-like attribute value (such as attribute
"display"'s value which include "inline", "block", "inline-block"...).
And the id attribute value as well as the class attribute value are
also good candidates for intern.

3. When to intern strings?
I think the best time to import lwc_string is when the webpage is
being parsed. The hubbub parser should create lwc_string when it come
across the above types of strings. I propose this way because, if the
hubbub did not create a lwc_string, we should create one in the libDOM
and this require two times of string scanning (one time when page get
parsed in hubbub and one time in libDOM for interning) and of course
not efficient.
For the XML parser, the only way left for us is to intern string in
libDOM. I mean, in the callbacks of libDOM binding.

4. What does the dom_string look like?
I propose, Change a little:

The origin is:
struct dom_string {
        uint8_t *ptr;                   /**< Pointer to string data */

        size_t len;                     /**< Byte length of string */

        dom_alloc alloc;                /**< Memory (de)allocation function */
        void *pw;                       /**< Client-specific data */

        uint32_t refcnt;                /**< Reference count */
};

It should be changed to:
struct dom_string {
        void *ptr;                   /**< Pointer to string data when
len >= 0, else pointer to the lwc_string which contain the real
string*/

        int len;                     /**< Byte length of string when
len >=0 and if len == -1, it means the dom_string is intern in ptr */

        dom_alloc alloc;                /**< Memory (de)allocation function */
        void *pw;                       /**< Client-specific data */

        uint32_t refcnt;                /**< Reference count */
};

I hope the above comment explain my intent well. The member len takes
two responsibility now. When it equals to -1, it means that the ptr
points to a lwc_string, otherwise if the "len >= 0", it means the ptr
points to a real string buffer. So, our dom_string comparison code
will like:

bool cmp(dom_string *str1, dom_string *str2){
    if (str1->len==-1 && str2->len == -1)
        return lwc_string_compare((lwc_string *)str1->ptr, (lwc_string
*)str2->ptr);

    uint8_t *s1 = (uint8_t *)str1->ptr, *s2 = (uint8_t *)str2->ptr;
    if (str1->len==-1)
       s1 = STR_OF(str1->ptr);
    if (str2->len == -1)
       s2 = STR_OF(str2->ptr);

    // Compare the two strings
}

5. Some more consideration...
When I think about the strings, I also suspect how we store a
character? Now, we use uint8_t in libDOM, but I think we should use
UTF-16 encoding in DOM. And use uint16_t to replace uint8_t.


I hope I have expressed my idea clearly, if anybody get confused by
me, please shot any question to me. Any criticism and advice will be
appreciated very much! Thanks!

Regards!
Bo

The dom_string design

Reply via email to