Re: [MacRuby-devel] Strings, Encodings and IO

Benjamin Stiglitz Tue, 07 Apr 2009 22:23:41 -0700

So plan B: We emulate Ruby 1.9 strings behavior on top of ofNSString/NSData.
I'm really interested in this discussion too. A little backgroundfor JRuby:

Thanks for the background, Charlie. This sort of history is veryinstructive.

* Java's strings are all UTF-16. In order to represent binary data,we ended up using a "raw" encoder/decoder and only using the bottombyte of each character. Wasteful, since every string was 2x aslarge, and slow, since IO had to up/downcast byte[] contents tochar[] and back.


Most CFStrings use a UTF-16 internal store as well.

* We want to move to an intermediate version, where we sometimeshave a byte[]-backed string and sometimes a char[]/String-backedstring. IronRuby does this already. This is, however, predicated onthe idea that byte[]-based strings rarely become char[]-basedstrings and vice versa. I don't have any evidence for or againstthat yet.
So it's a nearly identical problem for MacRuby, as I understand it.I'm interested in discussion around this topic, since we are stillmoving forward with JRuby and would like to improve interop withJava libraries. I will offer the following food for thought:
* Going with 100% objc strings at first is probably a good pragmaticstart. You'll have the perf/memory impact of encoding/decoding andwasteful string contents, but you should be able to get itfunctioning well. And since interop is a primary goal for MacRuby(where it's been somewhat secondary in JRuby) this is probably abetter place to start.

That’s where things stand today, and with Laurent’s ByteString workthis all mostly works as long as you don’t try to change encodingsaround on strings.

* Alternatively, you could only support a minimum set of encodingsand make it explicit that internally everything would be UTF-16 orMacRoman. In MacRuby's case, I think most people would happilyaccept that, just as a lot of JRuby users would probably accept thateverything's UTF-16 since that's what they get from Java normally.

This seems like a bad situation in the face of the varied encodinglandscape on the Internet.

Ultimately this is the exact reason I argued over a year ago thatRuby 1.9 should introduce a separate Bytes class used for IO. I wasdenied.

I was disappointed to see this turned down as well. The encodingsituation in 1.9 feels worse than it was in 1.8, and that’s prettyimpressive.

It's definitely a sticky issue, and Ruby has made it even stickierin 1.9 with arbitrary encoding support. None of the proposedsolutions across all implementations (including JRuby) have reallyseemed ideal to me.

Laurent and I discussed this a bit tonight, and here’s what I think wecan get away with:

By default, store all strings as NSString (UTF-16 backed) with an ivarto store the encoding e.

When getting bytes, convert to a ByteString in the appropriate encoding.

When doing force_encoding, convert to a ByteString in the oldencoding, then try to convert to an NSString in the new encoding. Ifwe succeed, great. If not, leave as a tagged ByteString (and probablywhine about it).

All ASCII-8BIT strings are backed by ByteString.

There’s some simplification here; some of the ByteStrings are reallyjust NSDatas, &c., but the flow is there. Sorry the list above is amess, I’m up much later than I’m accustomed to.


-Ben
_______________________________________________
MacRuby-devel mailing list
[email protected]
http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel

Re: [MacRuby-devel] Strings, Encodings and IO

Reply via email to