Re: charCodeAt for non-ASCII characters doesn't match browsers?

Attila Szegedi Sat, 18 Oct 2008 12:07:40 -0700

Okay, I extracted intelligent character encoding handling into aseparate class used by both shell and jsc. I also added -encodingswitch to jsc, and renamed it from -enc to -encoding in shell too. Thereason is that javac uses "-encoding" as well, so we keep it consistent.

Also, jsc and shell no longer choke on BOM either, so that also fixesbug 399347.


Attila.

On Oct 18, 2008, at 12:23 PM, Attila Szegedi wrote:

I added the "-enc" command line switch, as well as more intelligenthandling of character encodings (including relying on Content-typewhen reading from URL as well as autodetection of various UTFformats as per RFC 4329 when reading from URL or local file), see:
<https://bugzilla.mozilla.org/show_bug.cgi?id=399347#c3>
A Rhino JAR now built from CVS HEAD correctly prints 225 whenlaunched with "-enc utf-8":
MacBook-Ati:rhino aszegedi$ java -jar build/rhino1_7R2pre/js.jar -enc utf-8
Rhino 1.7 release 2 PRERELEASE 2008 10 18
js> print("á".charCodeAt(0))
225
js>
Actually if your file is UTF-8, UTF-16, or UTF-32 encoded, and has abyte order mark at the beggining of the file, it'll be correctlydecoded even without the "-enc" parameter.
Attila.

--
home: http://www.szegedi.org
weblog: http://constc.blogspot.com

On Oct 18, 2008, at 10:46 AM, Attila Szegedi wrote:
Well, I could reproduce this, and it seems to me to be a bug (atleast as far as shell is concerned). If I write this code snippet:
print("á".charCodeAt(0))
into a file "x.js", save it with UTF-8 encoding and run it withRhino using
java -jar js.jar x.js
it prints 8730. Turns out "á" is encoded as C3 A1, which is indeedUTF-8 for "á". Howeverjava.lang.System.getProperty("file.encoding") returns "MacRoman",and C3 in MacRoman translates to U+221A "SQUARE ROOT" character(decimal 8730). Same happens when directly typing it into theconsole.
So, there's a discrepancy between character encodings: console onMac OS X apparently feeds the characters as UTF-8 encoded bytestream through System.in, but Rhino shell reads them as MacRoman,as that's the default Java encoding in the JRE (value of the"file.encoding" system property). Taken at face value, this isactually a bug in Java; if the console is UTF-8 based, the JREshould detect that, and set "file.encoding" to utf-8.
We could work around it if Rhino shell had an explicit command lineencoding declaration, i.e. if you could specify "-c utf-8" --that'd solve it.
Actually, I believe I'll just write code to solve this that'd beconformant to RFC-4329.
Attila.

--
home: http://www.szegedi.org
weblog: http://constc.blogspot.com

On Oct 18, 2008, at 12:31 AM, tlrobinson wrote:
In Rhino if I do "á".charCodeAt(0) I get 8730, whereas in Firefox,
Safari I get 225. (that's option-"e" then "a" in OS X)

Is this undefined behavior in JavaScript, a bug, or am I doing
something weird?

Thanks.


_______________________________________________
dev-tech-js-engine-rhino mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-tech-js-engine-rhino

Re: charCodeAt for non-ASCII characters doesn't match browsers?

Reply via email to