Re: [Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions

Igor Karpov Wed, 17 Jul 2002 07:27:25 -0700

To start with, It would be enough to support autodetection of plain old
us-ascii, utf-8 and the system default encoding.
For instance, I use only those three.
And, of course, it should be XML-friendly.
Personally, I have some XMLs that are in UTF-8 and some are just without any
encoding (assuming us-ascii), and it's still a pain to edit UTF-8. Even if I
had Win2000(XP?) here with its mighty notepad, it would insert its 3 bytes at
the start that drive XML tools crazy...


-----Original Message-----
From: "Maxim Shafirov" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Date: Wed, 17 Jul 2002 18:20:10 +0400
Subject: [Eap-features] Re: Improvement: make encoding detection intelligent,
not based on assumptions

> 
> It is not as easy unfortunately. Thus, characters over 127 do not imply
> ISO-8859-1 at all. There could be other national encoding installed like
> KOI8-R (russian one). Physically there's could not be other possibility to
> determine which single byte encoding is used other than statistic analysis
> which seems not be accurate when working on .java files - to few national
> characters...
> 
> --
> 
> Best regards,
> Maxim Shafirov
> JetBrains, Inc / IntelliJ Software
> http://www.intellij.com
> "Develop with pleasure!"
> 
> 
> "Guillaume Laforge" <[EMAIL PROTECTED]> wrote in message
> ah39d2$t8l$[EMAIL PROTECTED]">news:ah39d2$t8l$[EMAIL PROTECTED]...
> >
> > "Maxim Shafirov" <[EMAIL PROTECTED]> a �crit dans le message news:
> > ah1hl9$ju0$[EMAIL PROTECTED]
> > > Any ideas how to determine an encoding by file content?
> > >
> > > --
> > >
> >
> > Hi
> >
> > You need to read raw bytes for some "tell-tale signs": I don't know the
> > details offhand (but they're available in the RFCs for UTF), but I've got
> a
> > rough idea how to do this.  It's not perfect: this strategy won't be easy
> to
> > adapt to charsets where all byte values can be used, but with UTF-8 it's
> > easy enough ;-).  Even without a "marker byte" at the start of the stream,
> > if you read up to a certain limit (4K? 8K? 16K even?), you should come
> > across characters that are very obvious signs as to which encoding should
> be
> > used.  For example, if the file's in ISO-8859-1 or similar, you'll
> probably
> > encounter a number of characters with byte values between 128-255; with
> > UTF-8, only a very specific subset is possible (the possible flags are
> > described in the RFCs).  If you only encounter these flags, you can assume
> > UTF-8, otherwise if you encounter other high-byte values, you can assume
> > ISO-8859-1.  If no such markers are found, you can probably use the
> platform
> > default.
> >
> > Code idea:
> >
> > // wrap the input stream
> > SmartEncodingByteReader br = new SmartEncodingByteReader(anInputStream);
> >
> > // read some bytes, just until the encoding can be guessed
> > String encoding = br.readEncoding();
> >
> > // read as normal
> > InputStreamReader reader = new InputStreamReader(br, encoding);
> >
> > In the "readEncoding" method, the bytes would be buffered whilst reading
> and
> > trying to guess.  When you start reading from the character reader, the
> > smart byte reader first provides bytes from its internal buffer, then
> > continues reading from the stream for the remainder of the data.  The size
> > limitation stops creating a huge buffer for very large files.
> >
> > Hope this helps, thanks for your interest,
> > Guillaume
> >
> >
> 
> 
> _______________________________________________
> Eap-features mailing list
> [EMAIL PROTECTED]
> http://lists.jetbrains.com/mailman/listinfo/eap-features
> 


------------------------------------
Mail.Ru - ������, ��������, �������!
------------------------------------
_______________________________________________
Eap-features mailing list
[EMAIL PROTECTED]
http://lists.jetbrains.com/mailman/listinfo/eap-features

Re: [Eap-features] Re: Improvement: make encoding detection intelligent, not based on assumptions

Reply via email to