[Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions

Guillaume Laforge Wed, 17 Jul 2002 01:06:08 -0700


"Maxim Shafirov" <[EMAIL PROTECTED]> a �crit dans le message news:
ah1hl9$ju0$[EMAIL PROTECTED]
> Any ideas how to determine an encoding by file content?
>
> --
>


Hi

You need to read raw bytes for some "tell-tale signs": I don't know the
details offhand (but they're available in the RFCs for UTF), but I've got a
rough idea how to do this.  It's not perfect: this strategy won't be easy to
adapt to charsets where all byte values can be used, but with UTF-8 it's
easy enough ;-).  Even without a "marker byte" at the start of the stream,
if you read up to a certain limit (4K? 8K? 16K even?), you should come
across characters that are very obvious signs as to which encoding should be
used.  For example, if the file's in ISO-8859-1 or similar, you'll probably
encounter a number of characters with byte values between 128-255; with
UTF-8, only a very specific subset is possible (the possible flags are
described in the RFCs).  If you only encounter these flags, you can assume
UTF-8, otherwise if you encounter other high-byte values, you can assume
ISO-8859-1.  If no such markers are found, you can probably use the platform
default.

Code idea:

// wrap the input stream
SmartEncodingByteReader br = new SmartEncodingByteReader(anInputStream);

// read some bytes, just until the encoding can be guessed
String encoding = br.readEncoding();

// read as normal
InputStreamReader reader = new InputStreamReader(br, encoding);

In the "readEncoding" method, the bytes would be buffered whilst reading and
trying to guess.  When you start reading from the character reader, the
smart byte reader first provides bytes from its internal buffer, then
continues reading from the stream for the remainder of the data.  The size
limitation stops creating a huge buffer for very large files.

Hope this helps, thanks for your interest,
Guillaume


_______________________________________________
Eap-features mailing list
[EMAIL PROTECTED]
http://lists.jetbrains.com/mailman/listinfo/eap-features

[Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions

Reply via email to