To start with, It would be enough to support autodetection of plain old us-ascii, utf-8 and the system default encoding. For instance, I use only those three. And, of course, it should be XML-friendly. Personally, I have some XMLs that are in UTF-8 and some are just without any encoding (assuming us-ascii), and it's still a pain to edit UTF-8. Even if I had Win2000(XP?) here with its mighty notepad, it would insert its 3 bytes at the start that drive XML tools crazy...
-----Original Message----- From: "Maxim Shafirov" <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Date: Wed, 17 Jul 2002 18:20:10 +0400 Subject: [Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions > > It is not as easy unfortunately. Thus, characters over 127 do not imply > ISO-8859-1 at all. There could be other national encoding installed like > KOI8-R (russian one). Physically there's could not be other possibility to > determine which single byte encoding is used other than statistic analysis > which seems not be accurate when working on .java files - to few national > characters... > > -- > > Best regards, > Maxim Shafirov > JetBrains, Inc / IntelliJ Software > http://www.intellij.com > "Develop with pleasure!" > > > "Guillaume Laforge" <[EMAIL PROTECTED]> wrote in message > ah39d2$t8l$[EMAIL PROTECTED]">news:ah39d2$t8l$[EMAIL PROTECTED]... > > > > "Maxim Shafirov" <[EMAIL PROTECTED]> a �crit dans le message news: > > ah1hl9$ju0$[EMAIL PROTECTED] > > > Any ideas how to determine an encoding by file content? > > > > > > -- > > > > > > > Hi > > > > You need to read raw bytes for some "tell-tale signs": I don't know the > > details offhand (but they're available in the RFCs for UTF), but I've got > a > > rough idea how to do this. It's not perfect: this strategy won't be easy > to > > adapt to charsets where all byte values can be used, but with UTF-8 it's > > easy enough ;-). Even without a "marker byte" at the start of the stream, > > if you read up to a certain limit (4K? 8K? 16K even?), you should come > > across characters that are very obvious signs as to which encoding should > be > > used. For example, if the file's in ISO-8859-1 or similar, you'll > probably > > encounter a number of characters with byte values between 128-255; with > > UTF-8, only a very specific subset is possible (the possible flags are > > described in the RFCs). If you only encounter these flags, you can assume > > UTF-8, otherwise if you encounter other high-byte values, you can assume > > ISO-8859-1. If no such markers are found, you can probably use the > platform > > default. > > > > Code idea: > > > > // wrap the input stream > > SmartEncodingByteReader br = new SmartEncodingByteReader(anInputStream); > > > > // read some bytes, just until the encoding can be guessed > > String encoding = br.readEncoding(); > > > > // read as normal > > InputStreamReader reader = new InputStreamReader(br, encoding); > > > > In the "readEncoding" method, the bytes would be buffered whilst reading > and > > trying to guess. When you start reading from the character reader, the > > smart byte reader first provides bytes from its internal buffer, then > > continues reading from the stream for the remainder of the data. The size > > limitation stops creating a huge buffer for very large files. > > > > Hope this helps, thanks for your interest, > > Guillaume > > > > > > > _______________________________________________ > Eap-features mailing list > [EMAIL PROTECTED] > http://lists.jetbrains.com/mailman/listinfo/eap-features > ------------------------------------ Mail.Ru - ������, ��������, �������! ------------------------------------ _______________________________________________ Eap-features mailing list [EMAIL PROTECTED] http://lists.jetbrains.com/mailman/listinfo/eap-features
