Re: Character Encoding

Ilkka Priha Sat, 25 Oct 2003 06:36:59 -0700

Geir Magnusson Jr. wrote:

On Friday, October 24, 2003, at 12:10 PM, Charles Morehead wrote:
Howdy,
First of all, my apologies if this functionality is already provided by
velocity. I did search through the docs and mailing list and didn't
find what I was looking for... so, that caveat aside:
I've been working with velocity bit, and one area that seems to be a
little lacking is the ability for templates to specify their own
encoding.
Indeed. It's a chicken and egg, isn't it? If the template is in UTF8, how do we read the template to find that out w/o knowing the character representation of the bytestream?

[SNIP some well-thought-out discussion]

Let me ask a question. What problem are you trying to solve? Is it that you need to *remember* what encoding a template is in? I've thought about that one on and off, and never came to a good conclusion other than writing a loader wrapper that would take a template name like

foo.vm.utf8

and do the right thing.

I've also thought about a scheme like you suggested, where we have an informal documentation-style approach, like

## @encoding <encoding>

and then another wrapper loader that figures it out and calls the resource loader with the correct info.

geir

Hi,

why not to apply the XML (and XHTML and WML as well) declaration instead of a velocity specific one as most markups are based on XML:

<?xml version="1.0" encoding="UTF-8" ?>

The encoding of this first line is detected by reading two to four bytes as specified in http://www.w3.org/TR/REC-xml#sec-guessing

"Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be '<?xml', any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, '<' is "#x0000003C" and '?' is "#x0000003F", and the Byte Order Mark required of UTF-16 data streams is "#xFEFF". The notation ## is used to denote any byte value except that two consecutive ##s cannot be both 00.

00 00 FE FF     UCS-4, big-endian machine (1234 order)
FF FE 00 00     UCS-4, little-endian machine (4321 order)
00 00 FF FE     UCS-4, unusual octet order (2143)
FE FF 00 00     UCS-4, unusual octet order (3412)
FE FF ## ##     UTF-16, big-endian
FF FE ## ##     UTF-16, little-endian
EF BB BF        UTF-8

This level of autodetection is enough to read the XML encoding declaration and parse the character-encoding identifier, which is still necessary to distinguish the individual members of each family of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859 from each other, or to distinguish the specific EBCDIC code page in use, and so on)."

-- Ilkka


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Character Encoding

Reply via email to