Re: Detecting encoding in Plain text

Mark Davis Mon, 12 Jan 2004 10:01:00 -0800

One thing I have done in the past that was along similar lines:

If you know that it is a UTF, and if you know that you support the latest
version of Unicode, then you can walk through the bytes in 7 parallel paths,
with each fetching a code point in each of the 7 encoding schemes and testing
it. If you hit an illegal sequence or unassigned code point, then you 'turn off'
that path. If you have a single path at any point, then jump to a faster routine
to do the rest of the conversion. (I actually had 8 paths, since I also could
have Latin-1.)


I never put in anything to settle the cases where you end up with more than one
path, except for a simple priority order. In those rare cases where necessary, I
suspect something simple like capturing the frequency of a some common
characters, such as new lines, space, and certain punctuation, and some uncommon
characters (most controls) would go a long way.

Mark
__________________________________
http://www.macchiato.com
â ààààààààààààààààààààà â

----- Original Message ----- 
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Brijesh Sharma" <[EMAIL PROTECTED]>
Sent: Sun, 2004 Jan 11 21:48
Subject: Re: Detecting encoding in Plain text


> Brijesh Sharma <bssharma at quark dot co dot in> wrote:
>
> > I writing a small tool to get text from a txt file into a edit box.
> > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac
> > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc)
> > My problem is that I can distinguish between UTF-8 or UTF-16 using
> > the BOM.
> > But how do I auto detect the others.
> > Any kind of help will be appreciated.
>
> This has always been an interesting topic to me, even before the Unicode
> era.  The best information I have ever seen on this topic is Li and
> Momoi's paper.  To reiterate the URL:
>
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>
> If you are "writing a small tool," however, you may not have the space
> or time to implement everything Li and Momoi described.
>
> You probably need to divide the problem into (1) detection of Unicode
> encodings and (2) detection of non-Unicode encodings, because these are
> really different problems.
>
> Detecting Unicode encodings, of course, is trivial if the stream begins
> with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't
> always count on the signature being present.  You need to rely primarily
> on what Li and Momoi call the "coding scheme method," searching for
> valid (and invalid) sequences in the various encoding schemes.  This
> works well for UTF-8 in particular; most non-contrived text that
> contains at least one valid multibyte UTF-8 sequence and no invalid
> UTF-8 sequences is very likely to be UTF-8.
>
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics.  Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16.  If you need to
> detect the encoding of non-Western-European text, you would have to be
> more sophisticated than this.
>
> Here are some notes I've taken on detecting a byte stream known to be in
> a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU).  This is a
> work in progress and is not expected to be complete or perfect, so feel
> free to send corrections and enhancements but not flames:
>
> 0A 00
> â inverse of U+000A LINE FEED
> â U+0A00 = unassigned Gurmukhi code point
> â may indicate little-endian UTF-16
>
> 0A 0D
> â 8-bit line-feed + carriage return
> â U+0A0D = unassigned Gurmukhi code point
> â probably indicates 8-bit encoding
>
> 0D 00
> â inverse of U+000D CARRIAGE RETURN
> â U+0D00 = unassigned Malayalam code point
> â may indicate little-endian UTF-16
>
> 0D 0A
> â 8-bit carriage return + line feed
> â U+0D0A = MALAYALAM LETTER UU
>   â text should include other Malayalam characters (U+0D00âU+0D7F)
> â otherwise, probably indicates 8-bit encoding
>
> 20 00
> â inverse of U+0020 SPACE
> â U+2000 = EN QUAD (infrequent character)
> â may indicate UTF-16 (probably little-endian)
>
> 28 20
> â inverse of U+2028 LINE SEPARATOR
> â U+2820 = BRAILLE PATTERN DOTS-6
>   â text should include other Braille characters (U+2800âU+28FF)
> â may indicate little-endian UTF-16
> â but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis)
>
> E2 80 A8
> â UTF-8 representation of U+2028 LINE SEPARATOR
> â probably indicates UTF-8
>
> 05 28
> â SCSU representation of U+2028 LINE SEPARATOR
> â U+0528 is unassigned
> â U+2805 is BRAILLE PATTERN DOTS-13
>   â should be surrounded by other Braille characters
> â otherwise, probably indicates SCSU
>
> 00 00 00
> â probably a Basic Latin character in UTF-32 (either byte order)
>
> Detecting non-Unicode encodings is quite another matter, and here you
> really need to study the techniques described by Li and Momoi.
> Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is
> easy -- just check which subsets of Windows-1252 are present -- but
> throwing Mac Roman and East Asian double-byte sets into the mix is
> another matter.
>
> I once wrote a program to detect the encoding of a text sample known to
> be in one of the following Cyrillic encodings:
>
> â KOI8-R
> â Windows code page 1251
> â ISO 8859-5
> â MS-DOS code page 866
> â MS-DOS code page 855
> â Mac Cyrillic
>
> Given the Unicode scalar values corresponding to each byte value, the
> program calculates the proportion of Cyrillic characters (as opposed to
> punctuation and dingbats) when interpreted in each possible encoding,
> and picks the encoding with the highest proportion (confidence level).
> This is a dumbed-down version of Li and Momoi's character distribution
> method, but works surprisingly well so long as the text really is in one
> of these Cyrillic encodings.  It fails spectacularly for text in
> Latin-1, Mac Roman, UTF-8, etc.  It would probably also be unable to
> detect differences between almost-identical character sets, like KOI8-R
> and KOI8-U.
>
> The smaller your list of "possible" encodings, the easier your job of
> detecting one of them.
>
> -Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/
>
>
>

Re: Detecting encoding in Plain text

Reply via email to