We use a lot of wide character strings in Chrome. We also use a lot of wide character constants in Chrome. In most cases this isn't really necessary, it's just we went wide everywhere in the beginning for simplicity / consistency.
There are a few problems: - On Linux and Mac, wchar_t is 4 bytes per character, which leads to a lot of wasted memory. - Even on Windows, wchar_t is 2 bytes, doubling the size for a lot of strings that are ASCII. - On Linux, wchar_t is not really used, and is just there to conform to POSIX, so it's sort of abnormal to be using it. We've solved some of these problems by trying to hide the underlying string, like FilePath. Others have been solved with overloaded functions, templates, ifdefs, etc. We've converted a few bits over to ASCII already (StatsTable). I am not suggesting we stop using wide, I just wanted some opinions. I feel like wide is often used in situations where it would be simpler, more efficient, and more portable to use a UTF8 string. In the case it needs to go wide at some Windows API, it is easy to UTF8ToWide. If there are concerns about UTF8 conversion performance, I think we'll find out it's not a problem. If it does show up to be, I have some ideas about how we could make it faster for most of these cases. I think a good example is chrome_constants.cc, I don't see why any of these should have to be wide. Some of them may make their way into filenames, etc, in which case they could be easily converted or be handled directly by FilePath methods, etc. Another good example would be chrome_switches, these will always be ASCII. So far I just mentioned constants, but I think this also applies to a lot of other parts to our code, and it makes sense to shift to UTF8 in a lot of our internal representations. It would be an interesting experiment to see how much of our memory usage on the browser side is strings, and how much of that is strings that could be represented in 7 bits per character. Just wanted to solicit thoughts, and make sure there is some sort of agreement and support if we start trying to UTF8 some pieces of Chrome. Also, when this has come up before, I've heard the argument that this means strings are no longer directly indexable (ie blah[3] gets you the 4th character). Well, this isn't true for wchar_t on Windows either. Since it is UTF16, it can have surrogate pairs (Unicode is current defined for something like 20-21 bits?). It depends what operations you need to do, but most are UTF8 / UTF16 safe even using direct indexing. Things like finding a substring, splitting on a known character / substring, etc are all safe on UTF8 strings, and the encoding was intentionally designed this way. If we have places where we're directly indexing wchar_t and that would break in UTF8, that means it was actually incorrect to begin with. These cases should be using proper ICU iterators. So in moving to UTF8, we might actually uncover some bugs, but anything that would break was broken before. Thanks -- dean --~--~---------~--~----~------------~-------~--~----~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~----------~----~----~----~------~----~------~--~---