[v8-users] Re: some minor string confusion :)

Pete Gontier Sun, 26 Oct 2008 13:29:33 -0700

On Oct 26, 2008, at 12:03 PM, Erik Corry wrote:

>  the default behavior will be to assume the encoding, UCS-2, which  
> is guaranteed to be free of surrogate pair subtleties.
>
> I don't understand what this could mean in practice.  If the input  
> contains only basic plane (16 bit characters) then there is no  
> difference between UCS-2 and UTF-16.  So in this case the flag would  
> make no difference.  If the input contains characters from the 20  
> bit space then UCS-2 can't represent them so what will you do with  
> them if the user specifies UCS-2 but has such characters.  I think  
> throwing them away would be worse than just leaving them in there as  
> surrogate pairs.  I suppose you could throw an exception but that  
> seems worse too.


I was planning to throw an exception.

Seems to me my choice here is between [1] doing nothing and allowing  
people to encounter subtle bugs in their own code and [2] being an  
annoying pedantic gatekeeper who forces people to explicitly request a  
potentially problematic situation. Neither option is perfect; the  
question is which is less bad.

The situation that concerns me most is that a team may write a lot of  
code which naively assumes JavaScript strings are UCS-2, because the  
team's native language fits into UCS-2, and maybe the language of  
their neighbors fits into UCS-2 as well, and by the time they realize  
their code has subtle problems processing UTF-16 text, their  
investment in their project is already too substantial to fix the  
problems, so they are forced, late in the development cycle, to  
abandon entire markets.

The exception would be a big unmistakable warning the very first time  
they attempt to use input text which doesn't fit into UCS-2 -- perhaps  
without realizing it -- before the problem has a chance to become  
tricky to diagnose. Yes, they can explicitly accept UTF-16 to inhibit  
the exception, but they had better know the rest of their code can  
actually process it, and they had better understand that they can't  
expect the built-in string and regexp facilities to help with that.

In short, my hope would be that the exception makes it easier to  
discover earlier that UTF-16 is a huge issue.


– Pete Gontier <http://pete.gontier.org/>




--~--~---------~--~----~------------~-------~--~----~
v8-users mailing list
[email protected]
http://groups.google.com/group/v8-users
-~----------~----~----~----~------~----~------~--~---

[v8-users] Re: some minor string confusion :)

Reply via email to