> On 20 Nov 2014, at 2:39 , Jan Vrany <jan.vr...@fit.cvut.cz> wrote: > > > But as I said, I'm more interested in 'low level' details like I > mentioned: > > - encoding of the source string > > > Best, Jan
IIRC, the .bin is the entire source string in Datastream-format, that is, is a datatype identifier (either ByteString or WideString), followed by the raw bytes/words (which is pure Latin1 if ByteString, UTF32-BE(?) for WideString, at least since leadingChar of the standard Unicode locale was changed to 0). So writing an encoder/decoder strictly for use with MCZ's isn't a very big task. (this is what gemstone does) The pure text file (.sources) is only used as a fallback** when importing code where the .bin is corrupted/absent, it should either be pure latin-1, or UTF-8*. Cheers, Henry * Not sure if it ended up being solved using a BOM-marker for UTF8 (as in the .cs format), or if a UTF8Encoder is used by default, with a fallback to latin1 if an incorrect utf8 character is encountered. ** Ironically, the string export was bugged up until recently, causing lots of confusion when non-latin1 .mcz exported/imported just fine in Squeak/Pharo, but failed to import elsewhere (where the .bin reading was not implemented)