[chromium-dev] Re: using string16
An angel loses its wings for each 00 byte in UTF-16. Is 'host' measured in base-2 or base-10? Linus On Tue, Feb 3, 2009 at 6:11 PM, Evan Martin wrote: > > [A bunch of the team met up today to hammer out some decisions.] > > In brief: for strings that are known to be Unicode (that is, not > random byte strings read from a file), we will migrate towards using > string16. This means all places we use wstring should be split into > the appropriate types: > - byte strings should be string or vectors of chars > - paths should be FilePath > - urls should be GURL > - UI strings, etc. should be string16. > > string16 uses UTF-16 underneath. It's equivalent to wstring on > Windows, but wstring involves 4-byte characters on Linux/Mac. > > Some important factors were: > - we don't have too many strings in this category (with the huge > exception of WebKit), so memory usage isn't much of an issue > - it's the native string type of Windows, Mac, and WebKit > - we want it to be explicit (i.e. a compile error) when you > accidentally use a byte string in a place where we should know the > encoding (which std::string and UTF-8 doesn't allow) > - we still use UTF-8 in some places (like the history full-text > database) where space is more of a concern > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
Hey Evan, I apologize for missing this discussion, I'm sure that I'm not seeing the entire picture and the pros of this argument. I mentioned before that I'm in support of utf-8 everywhere we can get it. We are obviously going to have platform specific code for the UI (win32 / cocoa/objective-c / gtk), and it makes sense to use the native UI string type there. However, I think it should be possible for all "non-platform" common code and interfaces to be in utf-8, and I feel like this would be a more logical design and equivalent performance. I just wanted to point out a few concerns I have with using string16 in general. - Another string type. It's already a bit confusing with WebKit strings, StringPiece, std::string, std::wstring, and string16. I feel like making the UI be string16 is going to prevent us from every really pushing one string encoding everywhere. - WebKit strings are not an argument for string16 We don't have to interact with WebKit from the UI, and we have a very nice interface there forced onto us by the IPC. So I don't think WebKit using utf-16 is an argument for our UI code. WebKit's use of utf-16 is forced by the JavaScript standard. - std::wstring == std::string, only on Windows I think this will cause some confusion and likely a few bugs, where strings are improperly converted/confused between the two. - You can't have string16 literals on Mac / Linux. On Windows, L"foo" will be a 16-bit string, making it fine as a std::wstring or string16. On Mac and Linux these will be 32-bit, unless we compile with -fshort-wchar, but I'm not sure that's a good idea. This means any string literals will need to be stored in another encoding (ascii, utf-8, wchar_t), and then converted to UTF-16. This isn't so strange until you think of what will happen on Linux, when we have utf-8 -> utf-16 -> utf-8 -> gtk. - We don't have good library functions for string16 We have a lot of great things in string_util, and most operate on std::string / ascii / utf8 / std::wstring. We would need to add string16 versions for all of these (at least it would be really nice to be able to use them). - Memory / speed You pointed out originally this isn't a big deal, and that we don't have many UI strings. (This will later be my argument for why paying a utf-8 -> native conversion isn't a problem). utf-8 is a more concise memory encoding, meaning for very commonly ASCII cases we save a byte, and for unicode cases, it's probably the same, the utf-8 encoding would probably only take 2 bytes. This also makes a difference in performance, since memory is a bottle neck, and you have to deal with less of it. Probably not really worth evaluating in this setting, but I just wanted to point out that I feel like utf-8 is the superior encoding here. I'm definitely looking forward to the other side of the picture, and why using string16 will make our UI code simpler on Mac and Linux. On Wed, Feb 4, 2009 at 3:11 AM, Evan Martin wrote: > > [A bunch of the team met up today to hammer out some decisions.] > > In brief: for strings that are known to be Unicode (that is, not > random byte strings read from a file), we will migrate towards using > string16. This means all places we use wstring should be split into > the appropriate types: > - byte strings should be string or vectors of chars > - paths should be FilePath > - urls should be GURL > - UI strings, etc. should be string16. > > string16 uses UTF-16 underneath. It's equivalent to wstring on > Windows, but wstring involves 4-byte characters on Linux/Mac. > > Some important factors were: > - we don't have too many strings in this category (with the huge > exception of WebKit), so memory usage isn't much of an issue > - it's the native string type of Windows, Mac, and WebKit > - we want it to be explicit (i.e. a compile error) when you > accidentally use a byte string in a place where we should know the > encoding (which std::string and UTF-8 doesn't allow) > - we still use UTF-8 in some places (like the history full-text > database) where space is more of a concern > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee wrote: > I apologize for missing this discussion, I'm sure that I'm not seeing > the entire picture and the pros of this argument. I mentioned before > that I'm in support of utf-8 everywhere we can get it. I lost this argument, so I will defer this response to someone else. :) --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
The proposal was to search-n-replace std::wstring to string16. We would have to invent a macro to replace L"" usage. Most usages of string literals are in unit tests, so it doesn't seem to matter if there is cost associated with the macro. My belief is that there isn't much fruit to be had by converting everything to UTF-8. I fear people passing non-UTF-8 strings around using std::string and the bugs that ensue from that. We've had those problems in areas that deal with UTF-8 and non-UTF-8 byte arrays. Whenever we have a string16 or a wstring, it means implicitly that we have unicode that can be displayed to the user. So, the compiler helps us not screw up. If someone can make a compelling performance argument for changing Chrome's UI over to UTF-8 and also invent a solution that avoids the problem I described above, then converting to UTF-8 would seem OK to me. But, right now... it just looks like cost for not much benefit. -Darin On Wed, Feb 4, 2009 at 8:21 AM, Evan Martin wrote: > > On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee wrote: > > I apologize for missing this discussion, I'm sure that I'm not seeing > > the entire picture and the pros of this argument. I mentioned before > > that I'm in support of utf-8 everywhere we can get it. > > I lost this argument, so I will defer this response to someone else. :) > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
On Wed, Feb 4, 2009 at 6:11 PM, Darin Fisher wrote: > The proposal was to search-n-replace std::wstring to string16. We would > have to invent a macro to replace L"" usage. Most usages of string literals > are in unit tests, so it doesn't seem to matter if there is cost associated > with the macro. > My belief is that there isn't much fruit to be had by converting everything > to UTF-8. I fear people passing non-UTF-8 strings around using std::string > and the bugs that ensue from that. We've had those problems in areas that > deal with UTF-8 and non-UTF-8 byte arrays. > Whenever we have a string16 or a wstring, it means implicitly that we have > unicode that can be displayed to the user. So, the compiler helps us not > screw up. This seems to be the only argument you make, that by making string16 a new type, we know it's encoding. This can be solved by many other ways by keeping utf8. We can add a new utf8 string class if you really wanted, or we can just be diligent and make sure to DCHECK in methods that expect a specific encoding. Have we had a lot of these problems? Do you have some examples? It would help me figure out solutions for better checking for utf-8. > If someone can make a compelling performance argument for changing Chrome's > UI over to UTF-8 and also invent a solution that avoids the problem I > described above, then converting to UTF-8 would seem OK to me. But, right > now... it just looks like cost for not much benefit. > -Darin > > On Wed, Feb 4, 2009 at 8:21 AM, Evan Martin wrote: >> >> On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee wrote: >> > I apologize for missing this discussion, I'm sure that I'm not seeing >> > the entire picture and the pros of this argument. I mentioned before >> > that I'm in support of utf-8 everywhere we can get it. >> >> I lost this argument, so I will defer this response to someone else. :) >> >> >> > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
Trying to remember what came up along the discussion. UTF16 is what Mac/win use, so there we can avoid a batch of conversions on those two platforms. (Mac can take UTF8, but the system would still be doing conversions to get things into a form it prefers) Didn't someone say ICU needs things in 16bit also, so every time we call one of those apis, we'd be converting round tripping if we went w/ UTF8? TVL On Wed, Feb 4, 2009 at 12:35 PM, Dean McNamee wrote: > > On Wed, Feb 4, 2009 at 6:11 PM, Darin Fisher wrote: > > The proposal was to search-n-replace std::wstring to string16. We would > > have to invent a macro to replace L"" usage. Most usages of string > literals > > are in unit tests, so it doesn't seem to matter if there is cost > associated > > with the macro. > > My belief is that there isn't much fruit to be had by converting > everything > > to UTF-8. I fear people passing non-UTF-8 strings around using > std::string > > and the bugs that ensue from that. We've had those problems in areas > that > > deal with UTF-8 and non-UTF-8 byte arrays. > > Whenever we have a string16 or a wstring, it means implicitly that we > have > > unicode that can be displayed to the user. So, the compiler helps us not > > screw up. > > This seems to be the only argument you make, that by making string16 a > new type, we know it's encoding. This can be solved by many other > ways by keeping utf8. We can add a new utf8 string class if you > really wanted, or we can just be diligent and make sure to DCHECK in > methods that expect a specific encoding. Have we had a lot of these > problems? Do you have some examples? It would help me figure out > solutions for better checking for utf-8. > > > If someone can make a compelling performance argument for changing > Chrome's > > UI over to UTF-8 and also invent a solution that avoids the problem I > > described above, then converting to UTF-8 would seem OK to me. But, > right > > now... it just looks like cost for not much benefit. > > -Darin > > > > On Wed, Feb 4, 2009 at 8:21 AM, Evan Martin wrote: > >> > >> On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee > wrote: > >> > I apologize for missing this discussion, I'm sure that I'm not seeing > >> > the entire picture and the pros of this argument. I mentioned before > >> > that I'm in support of utf-8 everywhere we can get it. > >> > >> I lost this argument, so I will defer this response to someone else. :) > >> > >> >> > > > > > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
On Wed, Feb 4, 2009 at 9:35 AM, Dean McNamee wrote: > > On Wed, Feb 4, 2009 at 6:11 PM, Darin Fisher wrote: > > The proposal was to search-n-replace std::wstring to string16. We would > > have to invent a macro to replace L"" usage. Most usages of string > literals > > are in unit tests, so it doesn't seem to matter if there is cost > associated > > with the macro. > > My belief is that there isn't much fruit to be had by converting > everything > > to UTF-8. I fear people passing non-UTF-8 strings around using > std::string > > and the bugs that ensue from that. We've had those problems in areas > that > > deal with UTF-8 and non-UTF-8 byte arrays. > > Whenever we have a string16 or a wstring, it means implicitly that we > have > > unicode that can be displayed to the user. So, the compiler helps us not > > screw up. > > This seems to be the only argument you make, that by making string16 a > new type, we know it's encoding. This can be solved by many other > ways by keeping utf8. We can add a new utf8 string class if you > really wanted, or we can just be diligent and make sure to DCHECK in > methods that expect a specific encoding. Have we had a lot of these > problems? Do you have some examples? It would help me figure out > solutions for better checking for utf-8. We have had a lot of these problems in the code that interfaces with WinHTTP and other networking code where std::string is used to relay headers, which do not necessarily have a known encoding. I've also seen this kind of problem over-and-over-again in the Mozilla code base. I think we have much bigger fish to fry so, I'd need to hear a convincing argument about why investing time and energy in converting from UTF-16 to UTF-8 is a good idea. -Darin > > > > If someone can make a compelling performance argument for changing > Chrome's > > UI over to UTF-8 and also invent a solution that avoids the problem I > > described above, then converting to UTF-8 would seem OK to me. But, > right > > now... it just looks like cost for not much benefit. > > -Darin > > > > On Wed, Feb 4, 2009 at 8:21 AM, Evan Martin wrote: > >> > >> On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee > wrote: > >> > I apologize for missing this discussion, I'm sure that I'm not seeing > >> > the entire picture and the pros of this argument. I mentioned before > >> > that I'm in support of utf-8 everywhere we can get it. > >> > >> I lost this argument, so I will defer this response to someone else. :) > >> > >> >> > > > > > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
The big string area is webkit, of course. If webkit were 100% UTF-8 already, we might take a different stance on this issue as well. If it is our goal to get to UTF-8 everywhere, then laying the plumbing for utf8 strings rather than string16 strings seems like the right thing to do. Mike On Wed, Feb 4, 2009 at 9:52 AM, Darin Fisher wrote: > On Wed, Feb 4, 2009 at 9:35 AM, Dean McNamee wrote: > >> >> On Wed, Feb 4, 2009 at 6:11 PM, Darin Fisher wrote: >> > The proposal was to search-n-replace std::wstring to string16. We would >> > have to invent a macro to replace L"" usage. Most usages of string >> literals >> > are in unit tests, so it doesn't seem to matter if there is cost >> associated >> > with the macro. >> > My belief is that there isn't much fruit to be had by converting >> everything >> > to UTF-8. I fear people passing non-UTF-8 strings around using >> std::string >> > and the bugs that ensue from that. We've had those problems in areas >> that >> > deal with UTF-8 and non-UTF-8 byte arrays. >> > Whenever we have a string16 or a wstring, it means implicitly that we >> have >> > unicode that can be displayed to the user. So, the compiler helps us >> not >> > screw up. >> >> This seems to be the only argument you make, that by making string16 a >> new type, we know it's encoding. This can be solved by many other >> ways by keeping utf8. We can add a new utf8 string class if you >> really wanted, or we can just be diligent and make sure to DCHECK in >> methods that expect a specific encoding. Have we had a lot of these >> problems? Do you have some examples? It would help me figure out >> solutions for better checking for utf-8. > > > We have had a lot of these problems in the code that interfaces with > WinHTTP and other networking code where std::string is used to relay > headers, which do not necessarily have a known encoding. I've also seen > this kind of problem over-and-over-again in the Mozilla code base. > > I think we have much bigger fish to fry so, I'd need to hear a > convincing argument about why investing time and energy in converting from > UTF-16 to UTF-8 is a good idea. > > -Darin > > > > >> >> >> > If someone can make a compelling performance argument for changing >> Chrome's >> > UI over to UTF-8 and also invent a solution that avoids the problem I >> > described above, then converting to UTF-8 would seem OK to me. But, >> right >> > now... it just looks like cost for not much benefit. >> > -Darin >> > >> > On Wed, Feb 4, 2009 at 8:21 AM, Evan Martin wrote: >> >> >> >> On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee >> wrote: >> >> > I apologize for missing this discussion, I'm sure that I'm not seeing >> >> > the entire picture and the pros of this argument. I mentioned before >> >> > that I'm in support of utf-8 everywhere we can get it. >> >> >> >> I lost this argument, so I will defer this response to someone else. >> :) >> >> >> >> >> >> > >> > >> >> >> > > > > --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: using string16
+1 to string16 I can't make performance or memory saving claims with a straight face for any. We just don't process enough strings for us to matter. On Feb 4, 9:57 am, Mike Belshe wrote: > The big string area is webkit, of course. If webkit were 100% UTF-8 > already, we might take a different stance on this issue as well. > > If it is our goal to get to UTF-8 everywhere, then laying the plumbing for > utf8 strings rather than string16 strings seems like the right thing to do. > > Mike > > On Wed, Feb 4, 2009 at 9:52 AM, Darin Fisher wrote: > > On Wed, Feb 4, 2009 at 9:35 AM, Dean McNamee wrote: > > >> On Wed, Feb 4, 2009 at 6:11 PM, Darin Fisher wrote: > >> > The proposal was to search-n-replace std::wstring to string16. We would > >> > have to invent a macro to replace L"" usage. Most usages of string > >> literals > >> > are in unit tests, so it doesn't seem to matter if there is cost > >> associated > >> > with the macro. > >> > My belief is that there isn't much fruit to be had by converting > >> everything > >> > to UTF-8. I fear people passing non-UTF-8 strings around using > >> std::string > >> > and the bugs that ensue from that. We've had those problems in areas > >> that > >> > deal with UTF-8 and non-UTF-8 byte arrays. > >> > Whenever we have a string16 or a wstring, it means implicitly that we > >> have > >> > unicode that can be displayed to the user. So, the compiler helps us > >> not > >> > screw up. > > >> This seems to be the only argument you make, that by making string16 a > >> new type, we know it's encoding. This can be solved by many other > >> ways by keeping utf8. We can add a new utf8 string class if you > >> really wanted, or we can just be diligent and make sure to DCHECK in > >> methods that expect a specific encoding. Have we had a lot of these > >> problems? Do you have some examples? It would help me figure out > >> solutions for better checking for utf-8. > > > We have had a lot of these problems in the code that interfaces with > > WinHTTP and other networking code where std::string is used to relay > > headers, which do not necessarily have a known encoding. I've also seen > > this kind of problem over-and-over-again in the Mozilla code base. > > > I think we have much bigger fish to fry so, I'd need to hear a > > convincing argument about why investing time and energy in converting from > > UTF-16 to UTF-8 is a good idea. > > > -Darin > > >> > If someone can make a compelling performance argument for changing > >> Chrome's > >> > UI over to UTF-8 and also invent a solution that avoids the problem I > >> > described above, then converting to UTF-8 would seem OK to me. But, > >> right > >> > now... it just looks like cost for not much benefit. > >> > -Darin > > >> > On Wed, Feb 4, 2009 at 8:21 AM, Evan Martin wrote: > > >> >> On Wed, Feb 4, 2009 at 6:53 AM, Dean McNamee > >> wrote: > >> >> > I apologize for missing this discussion, I'm sure that I'm not seeing > >> >> > the entire picture and the pros of this argument. I mentioned before > >> >> > that I'm in support of utf-8 everywhere we can get it. > > >> >> I lost this argument, so I will defer this response to someone else. > >> :) --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---