RE: Writing UTF-8 files with CF

Steven Erat Thu, 13 May 2004 10:56:55 -0700

Passing along some quoted information from another source that might be
helpful...

I think some of  the confusion about encoding is because it is not always
clear that there are two separate conversions which happen when CFMX
processes a template:

* conversion #1 -  The template file is read and converted into 16bit
Unicode (UTF-16LE on "little-endian" machines like Intel.  UTF16-BE on
"big-endian" machines like Sun)

    ... the template is processed by ColdFusion MX ...

* conversion #2 - The response to the browser is converted from 16bit
Unicode into the desired response encoding as it is sent.

These two conversions are independent.
For example, if a template is encoded as Cp1254 (Windows Turkish),  this
does not mean that the response to the browser will be encoded with Cp1254,
or that the header sent to the browser will indicate that the encoding is
Turkish -     "Content-Type: text/html; charset=Cp1254".

--- --- --- --- --- --- --- --- --- --- --- --- --- ---
- Regarding conversion #1

CFMX assumes that the encoding for a template is the user's local default
encoding per the operating system settings.

The encoding of a template may be explicitly specified to CFMX in one of two
ways:

1. With a BOM at the start of the file.  A BOM is a few non-printing bytes
before the first character of the file, which indicate how the file is
encoded.

2. using the <cfprocessingdirective pageencoding="encoding">  tag at the
start of the template.

Neither of these actually change anything in the template - they just inform
CFMX about how the template was encoded so CFMX can translate the template
into Unicode.  If the template is specified to be a certain encoding, the
user must be sure that the template was indeed encoded that way - or CFMX
will be fooled into making incorrect conversions as it reads the template.

--- --- --- --- --- --- --- --- --- --- --- --- --- ---
- Regarding conversion #2

By default, CFMX converts the 16bit Unicode which it uses internally into
UTF-8 encoding for sending responses. CFMX does this regardless of which
encoding the template originally had.  This should be appropriate for most
(all?) responses, since  UTF-8 can express all characters in all languages.
UTF-8 is supported by most (all?) browsers.

If you need to send the  response in a different encoding,  this can be done
with the <cfcontent> tag.  For example - if you wanted the response to go
out as UTF-16LE, you could use:  <cfcontent type="text/html;
charset=UTF-16LE">.   Of course, if you choose an unconventional encoding
(like UTF-16LE) you should make sure your browser can handle it correctly.
One note about using the <cfcontent> tag this way - there can be no spaces
in the charset=xxxxx part of the type attribute.  If you put a space before
or after the equals sign, it will not work.

--- --- --- --- --- --- --- --- --- --- --- --- --- ---
The list of encodings supported by Java 1.3.1 (and hence, by CFMX
standalone) is at
http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html  We supply
the i18n.jar file with CFMX, so both the short and the long lists of
encodings here are available to CFMX.

The three BOM characters which are recognized by CFMX are:

Name             Hex Values
----------------------        -----------------
UTF8_BOM         EF BB BF
UTF16LE_BOM      FF FE
UTF16BE_BOM      FE FF

About template files with 16bit character sets:

If a template file uses a 16bit version of Unicode, it should have a BOM.
If it doesn't, and the encoding is specified as <cfprocessingdirective
pageencoding="UTF-16"> , this doesn't tell CFMX  which "endian" machine the
file was created on.
If instead you used  <cfprocessingdirective pageencoding="UTF-16LE">  or
<cfprocessingdirective pageencoding="UTF-16BE"> , CFMX could correctly
identify whether the "little" byte or the "big" byte comes first in each
two-byte character.

> -----Original Message-----
> From: Tim Blair [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 13, 2004 12:27 PM
> To: CF-Talk
> Subject: Writing UTF-8 files with CF
>
> Evening all,
>
> A week or so ago I posted a message where I was having severe issues
> with writing UTF-8 encoded files from CF.  I did a bit of research (and
> a fair amount of swearing) and discovered the reason behind the problem
> I was having.
>
> In short, when writing UTF-8 encoded files, Java (and therefore CF)
> doesn't prepend the correct Byte Order Mark (BOM) to the file, so when
> it's read back in it's treated as single- not double-byte.
>
> For a bit more information and a funky work around to write correctly
> UTF-8 encoded files that can also be read back in OK, have a look here:
> http://tech.badpen.com/index.cfm?mode=entry&entry=21
>
> Tim.

[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

RE: Writing UTF-8 files with CF

Reply via email to