It would be great to be able to sanitize the files. We should get a test about tonel misbehavior.
Stef On Thu, Dec 14, 2017 at 3:11 PM, Henrik Sperre Johansen <[email protected]> wrote: > Stephan Eggermont-3 wrote >> On 05-12-17 08:59, Peter Uhnák wrote: >>> > In my case, it turned out to be a non-UTF8 encoded character in one >>> of the commit messages. >>> >>> I've ran into this problem in a sister project (tonel-migration), and do >>> not have a proper resolution yet. I was forcing everything to be >>> unicode, so I need a better way to read and write encoded strings. :< >> >> To be exact, exactly none of the older commits will be UTF8 encoded. For >> most it doesn't matter as they are ASCII, but if we want to have a >> change of converting older french or german code (or japanese), we need >> support for what was done with WideString. That probably needs a look in >> the squeak mailing list archives. >> >> Stephan > > The mcz reader used to import the .bin file (which contained correctly > serialized WideStrings), only falling back to reading the .st file if .bin > was not present, has this changed? > > Or do these tools explicitly ignore the .bin file and try to read the .st > file directly? > If so, the MCDataStream class used to read .bin format still seems to be in > the image... > > One could also create a tool to check/convert all mcz in a repo as a > preprocess; > if .bin contents decode as WideString, > check that .st starts with utf8 BOM, > if not, convert. > > Cheers, > Henry > > > > -- > Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html >
