Re: Character Encoding

Charles Morehead Mon, 27 Oct 2003 11:55:51 -0800

Hmm... how about this for an encoding determination algorythm?

 1) Examine the first few bytes, and look for unicode BOM
    markers. If they are present, return the appropriate
    unicode encoding.

 2) Examine bytes attempting to match (allowing for whitespace):
      ##encoding=Foo

    in ASCII/iso8859-1. If matched, return the specified encoding.

 3) If match fails, then return the default (or specified) encoding.

It would even be relatively simple to avoid opening the input stream a
second time, by wrapping the original input stream with a
markSupporting input stream with a "used only for mark/reset" buffer
of small but sufficient size (so as not to duplicate and possibly
defeat the improved performance of the buffering provided by the
BufferedReader).

This would require additional code in Template.process(), unless we
wanted to open and examine and close the InputStream for the resource
in ResourceManagerImpl.

[ BTW, I created a sub-class of Template to try this approach, but it
turned out I had to several classes because ResourceManagerImpl has a
hard-coded call to ResourceFactory, which is not extensible (since it
has static methods which are called directory from
ResourceManagerImpl). ]

Adding this feature makes it possible for Velocity to support correct
parsing of encodings without advance knowledge of the encoding. Is
this the sort of feature that would be accepted as a patch?

-Charles Morehead

Daniel Dekany writes:
 > Saturday, October 25, 2003, 3:29:52 PM, Ilkka Priha wrote:
 > 
 > [snip]
 > > why not to apply the XML (and XHTML and WML as well) declaration instead
 > > of a velocity specific one as most markups are based on XML:
 > >
 > > <?xml version="1.0" encoding="UTF-8" ?>
 > [snip]
 > 
 > It can't be used for Vel. templates, as it would be bad if Vel. does not
 > output <?xml ...?> as is... (and after all, Vel. templates are not XML).
 > 
 > As of the possibility of automatic charset detection, the problem is
 > with the non US-ASCII "compatible" charsets, as EBCDIC based charsets or
 > UTF-16. XML charset detection works only because it knows that
 > non-UTF-8/UTF-16 file start with '<', so 4C must mean that the file uses
 > EBCDIC characters, and also FE FF and such must be BOM (nor FE nor FF
 > nor 00 is '<' in any charsets). But in the case of Vel. templates, the
 > first character can be anything.
 > 
 > But still, a practical solution would be to read the file as ISO-8859-1,
 > and if it starts with #encoding=Foo, then use charset Foo to re-decode
 > the file. Of course, with this method, the special comment can't be
 > detected if the file uses UTF-16 or some EBCDIC spawn, but in this case
 > it just uses the default encoding as now, so you have lost nothing
 > compared to the current situation. At least it works for ISO-8889-X,
 > UTF-8, cpXXX, Shift_JIS, etc. These are the charsets almost everybody
 > uses anyway.
 > 
 > p.s. OK, it is possible to create an EBCDIC file that can be badly
 > interpreted in ISO-8859-1 as "#encoding=...", but, well... there is no
 > real chance for it.
 > 
 > -- 
 > Best regards,
 >  Daniel Dekany
 > 
 > 
 > 
 > ---------------------------------------------------------------------
 > To unsubscribe, e-mail: [EMAIL PROTECTED]
 > For additional commands, e-mail: [EMAIL PROTECTED]
 > 
 > 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Character Encoding

Reply via email to