So plan B: We emulate Ruby 1.9 strings behavior on top of of
NSString/NSData.
I'm really interested in this discussion too. A little background
for JRuby:
Thanks for the background, Charlie. This sort of history is very
instructive.
* Java's strings are all UTF-16. In order to represent binary data,
we ended up using a "raw" encoder/decoder and only using the bottom
byte of each character. Wasteful, since every string was 2x as
large, and slow, since IO had to up/downcast byte[] contents to
char[] and back.
Most CFStrings use a UTF-16 internal store as well.
* We want to move to an intermediate version, where we sometimes
have a byte[]-backed string and sometimes a char[]/String-backed
string. IronRuby does this already. This is, however, predicated on
the idea that byte[]-based strings rarely become char[]-based
strings and vice versa. I don't have any evidence for or against
that yet.
So it's a nearly identical problem for MacRuby, as I understand it.
I'm interested in discussion around this topic, since we are still
moving forward with JRuby and would like to improve interop with
Java libraries. I will offer the following food for thought:
* Going with 100% objc strings at first is probably a good pragmatic
start. You'll have the perf/memory impact of encoding/decoding and
wasteful string contents, but you should be able to get it
functioning well. And since interop is a primary goal for MacRuby
(where it's been somewhat secondary in JRuby) this is probably a
better place to start.
That’s where things stand today, and with Laurent’s ByteString work
this all mostly works as long as you don’t try to change encodings
around on strings.
* Alternatively, you could only support a minimum set of encodings
and make it explicit that internally everything would be UTF-16 or
MacRoman. In MacRuby's case, I think most people would happily
accept that, just as a lot of JRuby users would probably accept that
everything's UTF-16 since that's what they get from Java normally.
This seems like a bad situation in the face of the varied encoding
landscape on the Internet.
Ultimately this is the exact reason I argued over a year ago that
Ruby 1.9 should introduce a separate Bytes class used for IO. I was
denied.
I was disappointed to see this turned down as well. The encoding
situation in 1.9 feels worse than it was in 1.8, and that’s pretty
impressive.
It's definitely a sticky issue, and Ruby has made it even stickier
in 1.9 with arbitrary encoding support. None of the proposed
solutions across all implementations (including JRuby) have really
seemed ideal to me.
Laurent and I discussed this a bit tonight, and here’s what I think we
can get away with:
By default, store all strings as NSString (UTF-16 backed) with an ivar
to store the encoding e.
When getting bytes, convert to a ByteString in the appropriate encoding.
When doing force_encoding, convert to a ByteString in the old
encoding, then try to convert to an NSString in the new encoding. If
we succeed, great. If not, leave as a tagged ByteString (and probably
whine about it).
All ASCII-8BIT strings are backed by ByteString.
There’s some simplification here; some of the ByteStrings are really
just NSDatas, &c., but the flow is there. Sorry the list above is a
mess, I’m up much later than I’m accustomed to.
-Ben
_______________________________________________
MacRuby-devel mailing list
[email protected]
http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel