[Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions

Maxim Shafirov Wed, 17 Jul 2002 07:05:24 -0700

It is not as easy unfortunately. Thus, characters over 127 do not imply
ISO-8859-1 at all. There could be other national encoding installed like
KOI8-R (russian one). Physically there's could not be other possibility to
determine which single byte encoding is used other than statistic analysis
which seems not be accurate when working on .java files - to few national
characters...


--

Best regards,
Maxim Shafirov
JetBrains, Inc / IntelliJ Software
http://www.intellij.com
"Develop with pleasure!"


"Guillaume Laforge" <[EMAIL PROTECTED]> wrote in message
ah39d2$t8l$[EMAIL PROTECTED]">news:ah39d2$t8l$[EMAIL PROTECTED]...
>
> "Maxim Shafirov" <[EMAIL PROTECTED]> a �crit dans le message news:
> ah1hl9$ju0$[EMAIL PROTECTED]
> > Any ideas how to determine an encoding by file content?
> >
> > --
> >
>
> Hi
>
> You need to read raw bytes for some "tell-tale signs": I don't know the
> details offhand (but they're available in the RFCs for UTF), but I've got
a
> rough idea how to do this.  It's not perfect: this strategy won't be easy
to
> adapt to charsets where all byte values can be used, but with UTF-8 it's
> easy enough ;-).  Even without a "marker byte" at the start of the stream,
> if you read up to a certain limit (4K? 8K? 16K even?), you should come
> across characters that are very obvious signs as to which encoding should
be
> used.  For example, if the file's in ISO-8859-1 or similar, you'll
probably
> encounter a number of characters with byte values between 128-255; with
> UTF-8, only a very specific subset is possible (the possible flags are
> described in the RFCs).  If you only encounter these flags, you can assume
> UTF-8, otherwise if you encounter other high-byte values, you can assume
> ISO-8859-1.  If no such markers are found, you can probably use the
platform
> default.
>
> Code idea:
>
> // wrap the input stream
> SmartEncodingByteReader br = new SmartEncodingByteReader(anInputStream);
>
> // read some bytes, just until the encoding can be guessed
> String encoding = br.readEncoding();
>
> // read as normal
> InputStreamReader reader = new InputStreamReader(br, encoding);
>
> In the "readEncoding" method, the bytes would be buffered whilst reading
and
> trying to guess.  When you start reading from the character reader, the
> smart byte reader first provides bytes from its internal buffer, then
> continues reading from the stream for the remainder of the data.  The size
> limitation stops creating a huge buffer for very large files.
>
> Hope this helps, thanks for your interest,
> Guillaume
>
>


_______________________________________________
Eap-features mailing list
[EMAIL PROTECTED]
http://lists.jetbrains.com/mailman/listinfo/eap-features

[Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions

Reply via email to