On Sat, 2009-04-25 at 21:19 +0800, Bo Yang wrote: > There may be hundreds and thousands of strings, intern all these > will cause more collision in the hashtable and waste much memory.
Indeed. > 2. Which type of string should we intern? > According 1, I propose to intern strings optionally. We should use > lwc_string to restore the strings, which appears multiple times and > which are necessary to compare frequently. Generally, I mean HTML tag > name, attribute name, enum-like attribute value (such as attribute > "display"'s value which include "inline", "block", "inline-block"...). > And the id attribute value as well as the class attribute value are > also good candidates for intern. I wouldn't say "optionally" In order for libcss, hubbub and libdom to work sensibly together, the attribute names, tag names, and many of the attribute values *MUST* be interned for rapid selection, tree processing etc. Thus at minimum, I'd say tag names, attribute names and attribute values should be interned. The wapcaplet context can always be built to have a larger hash chain count if it turns out to be an overhead we can't ignore. However, CDATA and the like I don't think needs to be interned at all. > 3. When to intern strings? > I think the best time to import lwc_string is when the webpage is > being parsed. The hubbub parser should create lwc_string when it come > across the above types of strings. I propose this way because, if the > hubbub did not create a lwc_string, we should create one in the libDOM > and this require two times of string scanning (one time when page get > parsed in hubbub and one time in libDOM for interning) and of course > not efficient. > For the XML parser, the only way left for us is to intern string in > libDOM. I mean, in the callbacks of libDOM binding. Certainly hubbub will be interning strings which ought to be interned as it goes. As for the libxml binding, it'd be the responsibility of the binding to intern them before giving them to libdom. All libdom will need to do is ensure it increases the refcount on an lwc_string if it stores a pointer to it in a struct. (and decrease the refcount when it frees the container, obviously). > 4. What does the dom_string look like? > I propose, Change a little: Rather than change a little, my counterproposal is to change a lot. Refactoring code doesn't take a vast amount of time, and by ensuring we hit *every* use of a string, we can be sure we've considered everything in libdom appropriately. Thusly I propose: Anywhere dom_string is currently used for tag names, attribute names or attribute values, they are changed to directly use lwc_string. Anywhere dom_string is currently used for CDATA and the like, it is changed to dom_cdata_string (whose structure is identical to the current dom_string). We remove dom_string entirely. This means we will catch everything in one fell swoop. It'll be painful for a couple of days, but in the long-run will be superior. > 5. Some more consideration... > When I think about the strings, I also suspect how we store a > character? Now, we use uint8_t in libDOM, but I think we should use > UTF-16 encoding in DOM. And use uint16_t to replace uint8_t. Everything in the new libraries is based around UTF-8. Also, UTF-8 makes sense, where UTF-16 just feels like a kludge on top of Microsoft's insane fuckup wrt. their resource strings. > I hope I have expressed my idea clearly, if anybody get confused by > me, please shot any question to me. Any criticism and advice will be > appreciated very much! Thanks! You were clear, but I fear slightly misguided. Remember, a big refactor now can reduce effort in the long run. Don't be afraid to change the API. Until the first release of libdom, the API should be considered entirely fluid and subject to change so that we can get it to be as right as possible. Once we start integrating it into NetSurf proper, it will be much much harder to change. Regards, Daniel. -- Daniel Silverstone http://www.digital-scurf.org/ PGP mail accepted and encouraged. Key Id: 2BC8 4016 2068 7895
