Okay, I extracted intelligent character encoding handling into a
separate class used by both shell and jsc. I also added -encoding
switch to jsc, and renamed it from -enc to -encoding in shell too. The
reason is that javac uses "-encoding" as well, so we keep it consistent.
Also, jsc and shell no longer choke on BOM either, so that also fixes
bug 399347.
Attila.
On Oct 18, 2008, at 12:23 PM, Attila Szegedi wrote:
I added the "-enc" command line switch, as well as more intelligent
handling of character encodings (including relying on Content-type
when reading from URL as well as autodetection of various UTF
formats as per RFC 4329 when reading from URL or local file), see:
<https://bugzilla.mozilla.org/show_bug.cgi?id=399347#c3>
A Rhino JAR now built from CVS HEAD correctly prints 225 when
launched with "-enc utf-8":
MacBook-Ati:rhino aszegedi$ java -jar build/rhino1_7R2pre/js.jar -
enc utf-8
Rhino 1.7 release 2 PRERELEASE 2008 10 18
js> print("á".charCodeAt(0))
225
js>
Actually if your file is UTF-8, UTF-16, or UTF-32 encoded, and has a
byte order mark at the beggining of the file, it'll be correctly
decoded even without the "-enc" parameter.
Attila.
--
home: http://www.szegedi.org
weblog: http://constc.blogspot.com
On Oct 18, 2008, at 10:46 AM, Attila Szegedi wrote:
Well, I could reproduce this, and it seems to me to be a bug (at
least as far as shell is concerned). If I write this code snippet:
print("á".charCodeAt(0))
into a file "x.js", save it with UTF-8 encoding and run it with
Rhino using
java -jar js.jar x.js
it prints 8730. Turns out "á" is encoded as C3 A1, which is indeed
UTF-8 for "á". However
java.lang.System.getProperty("file.encoding") returns "MacRoman",
and C3 in MacRoman translates to U+221A "SQUARE ROOT" character
(decimal 8730). Same happens when directly typing it into the
console.
So, there's a discrepancy between character encodings: console on
Mac OS X apparently feeds the characters as UTF-8 encoded byte
stream through System.in, but Rhino shell reads them as MacRoman,
as that's the default Java encoding in the JRE (value of the
"file.encoding" system property). Taken at face value, this is
actually a bug in Java; if the console is UTF-8 based, the JRE
should detect that, and set "file.encoding" to utf-8.
We could work around it if Rhino shell had an explicit command line
encoding declaration, i.e. if you could specify "-c utf-8" --
that'd solve it.
Actually, I believe I'll just write code to solve this that'd be
conformant to RFC-4329.
Attila.
--
home: http://www.szegedi.org
weblog: http://constc.blogspot.com
On Oct 18, 2008, at 12:31 AM, tlrobinson wrote:
In Rhino if I do "á".charCodeAt(0) I get 8730, whereas in Firefox,
Safari I get 225. (that's option-"e" then "a" in OS X)
Is this undefined behavior in JavaScript, a bug, or am I doing
something weird?
Thanks.
_______________________________________________
dev-tech-js-engine-rhino mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-tech-js-engine-rhino