Re: regarding Latin1 to UTF8 encoding

Hugo Florentino Sun, 08 Dec 2013 19:34:46 -0800

On Mon, 09 Dec 2013 04:19:51 +0100, Adam D. Ruppe wrote:

On Monday, 9 December 2013 at 03:07:58 UTC, Hugo Florentino wrote:

Is there a way to detect the encoding prior to typecasting/loadingthe file?


UTF-8 can be detected fairly reliably, but not much luck for other
encodings. A Windows-1258 and a Latin1 file, for example, are usually

fairly indistinguishable from a binary perspective - they use thesame

numbers, just for different things.

(It is possible to distinguish them if you use some context and
grammar check kind of things, but that's not easy.)


But utf-8 has a neat feature: any non-ascii stuff needs to validate,
and it is unlikely that random data would correctly validate.

std.utf.validate can do that (though it throws an exception if it
fails, ugh!)

So here's how I did it in my own characterencodings.d:


https://github.com/adamdruppe/misc-stuff-including-D-programming-language-web-stuff/blob/master/characterencodings.d#L138


        string utf8string;
        import std.utf;
        try {
                validate!string(cast(string) rawdata);
                // validation passed, assume it is UTF-8 and use it
                utf8string = cast(string) rawdata;
        } catch(UTFException t) {
               // not utf-8, try latin1
               transcode(cast(Latin1String) rawData, utf8string);
        }

        // now go ahead and use utf8 string, it should be set


Clever solution, thanks.
Coud this work using scope instead of try/catch?

P.S. Nice unit, by the way.

Re: regarding Latin1 to UTF8 encoding

Reply via email to