"Maxim Shafirov" <[EMAIL PROTECTED]> a �crit dans le message news: ah1hl9$ju0$[EMAIL PROTECTED] > Any ideas how to determine an encoding by file content? > > -- >
Hi You need to read raw bytes for some "tell-tale signs": I don't know the details offhand (but they're available in the RFCs for UTF), but I've got a rough idea how to do this. It's not perfect: this strategy won't be easy to adapt to charsets where all byte values can be used, but with UTF-8 it's easy enough ;-). Even without a "marker byte" at the start of the stream, if you read up to a certain limit (4K? 8K? 16K even?), you should come across characters that are very obvious signs as to which encoding should be used. For example, if the file's in ISO-8859-1 or similar, you'll probably encounter a number of characters with byte values between 128-255; with UTF-8, only a very specific subset is possible (the possible flags are described in the RFCs). If you only encounter these flags, you can assume UTF-8, otherwise if you encounter other high-byte values, you can assume ISO-8859-1. If no such markers are found, you can probably use the platform default. Code idea: // wrap the input stream SmartEncodingByteReader br = new SmartEncodingByteReader(anInputStream); // read some bytes, just until the encoding can be guessed String encoding = br.readEncoding(); // read as normal InputStreamReader reader = new InputStreamReader(br, encoding); In the "readEncoding" method, the bytes would be buffered whilst reading and trying to guess. When you start reading from the character reader, the smart byte reader first provides bytes from its internal buffer, then continues reading from the stream for the remainder of the data. The size limitation stops creating a huge buffer for very large files. Hope this helps, thanks for your interest, Guillaume _______________________________________________ Eap-features mailing list [EMAIL PROTECTED] http://lists.jetbrains.com/mailman/listinfo/eap-features
