[chromium-dev] UTF8

Dean McNamee Thu, 15 Jan 2009 11:05:34 -0800

We use a lot of wide character strings in Chrome.  We also use a lot
of wide character constants in Chrome.  In most cases this isn't
really necessary, it's just we went wide everywhere in the beginning
for simplicity / consistency.


There are a few problems:

- On Linux and Mac, wchar_t is 4 bytes per character, which leads to a
lot of wasted memory.
- Even on Windows, wchar_t is 2 bytes, doubling the size for a lot of
strings that are ASCII.
- On Linux, wchar_t is not really used, and is just there to conform
to POSIX, so it's sort of abnormal to be using it.

We've solved some of these problems by trying to hide the underlying
string, like FilePath.  Others have been solved with overloaded
functions, templates, ifdefs, etc.  We've converted a few bits over to
ASCII already (StatsTable).

I am not suggesting we stop using wide, I just wanted some opinions.
I feel like wide is often used in situations where it would be
simpler, more efficient, and more portable to use a UTF8 string.  In
the case it needs to go wide at some Windows API, it is easy to
UTF8ToWide.  If there are concerns about UTF8 conversion performance,
I think we'll find out it's not a problem.  If it does show up to be,
I have some ideas about how we could make it faster for most of these
cases.

I think a good example is chrome_constants.cc, I don't see why any of
these should have to be wide.  Some of them may make their way into
filenames, etc, in which case they could be easily converted or be
handled directly by FilePath methods, etc.

Another good example would be chrome_switches, these will always be ASCII.

So far I just mentioned constants, but I think this also applies to a
lot of other parts to our code, and it makes sense to shift to UTF8 in
a lot of our internal representations.  It would be an interesting
experiment to see how much of our memory usage on the browser side is
strings, and how much of that is strings that could be represented in
7 bits per character.

Just wanted to solicit thoughts, and make sure there is some sort of
agreement and support if we start trying to UTF8 some pieces of
Chrome.

Also, when this has come up before, I've heard the argument that this
means strings are no longer directly indexable (ie blah[3] gets you
the 4th character).  Well, this isn't true for wchar_t on Windows
either.  Since it is UTF16, it can have surrogate pairs (Unicode is
current defined for something like 20-21 bits?).  It depends what
operations you need to do, but most are UTF8 / UTF16 safe even using
direct indexing.  Things like finding a substring, splitting on a
known character / substring, etc are all safe on UTF8 strings, and the
encoding was intentionally designed this way.  If we have places where
we're directly indexing wchar_t and that would break in UTF8, that
means it was actually incorrect to begin with.  These cases should be
using proper ICU iterators.  So in moving to UTF8, we might actually
uncover some bugs, but anything that would break was broken before.

Thanks
-- dean

--~--~---------~--~----~------------~-------~--~----~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
    http://groups.google.com/group/chromium-dev
-~----------~----~----~----~------~----~------~--~---

[chromium-dev] UTF8

Reply via email to