> On 20 Nov 2014, at 2:39 , Jan Vrany <jan.vr...@fit.cvut.cz> wrote:
> 
> 
> But as I said, I'm more interested in 'low level' details like I
> mentioned: 
> 
> - encoding of the source string
>  
> 
> Best, Jan

IIRC, the .bin is the entire source string in Datastream-format, that is, is a 
datatype identifier (either ByteString or WideString), followed by the raw 
bytes/words (which is pure Latin1 if ByteString, UTF32-BE(?) for WideString, at 
least since leadingChar of the standard Unicode locale was changed to 0). So 
writing an encoder/decoder strictly for use with MCZ's isn't a very big task. 
(this is what gemstone does)

The pure text file (.sources) is only used as a fallback** when importing code 
where the .bin is corrupted/absent, it should either be pure latin-1, or UTF-8*.

Cheers,
Henry

* Not sure if it ended up being solved using a BOM-marker for UTF8 (as in the 
.cs format), or if a UTF8Encoder is used by default, with a fallback to latin1 
if an incorrect utf8 character is encountered. 
** Ironically, the string export was bugged up until recently, causing lots of 
confusion when non-latin1 .mcz exported/imported just fine in Squeak/Pharo, but 
failed to import elsewhere (where the .bin reading was not implemented)

Reply via email to