Re: UTF-8 BOM generation option

Justin Dearing Mon, 31 Dec 2007 08:23:53 -0800

Nicholas,

The UTF-8 BOM debate is perhaps one of the largest wastes of time in
computer science out there. Its also one of the reasons I've joined
this list. So without further ado, let me add to the debate.


The java tools that build the xerces docs, at the ones least in the
2.8 svn tree, choke on the UTF-8 BOM. Thats only because they are old
and no one wants to fix them. The UTF-8 BOM is unnecessary, and
therefore should never be default IMHO Generally, a parser should be
able to handle UTF-8 with out without the BOM, from a matter of
practicality. So I think your suggestion falls within those
guidelines.

Regards,

Justin Dearing

On Dec 31, 2007 11:03 AM,  <[EMAIL PROTECTED]> wrote:
>
>
> Hello all,
>
> I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM be
> added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use this
> BOM when reading the XML file to determine encoding.  The reality of the
> situation is that a number of UTF-8 files do contain a BOM, and this trend
> seems to becoming more prevalent (at least with the XML datasets that I have
> been exposed to over the years) with time.
>
> Luckily, Xerces handles BOM markers for UTF-8 files already, there is not a
> compatibility issue with being able to read their own generated files.
>
> My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded file
> if is explicitly asked to do so through the serializer (DOMWriter) by
> setting the XMLUni::fgDOMWRTDOM feature.  Most people won't set this feature
> resulting in the current solution of generated UTF-8 files not containing
> the BOM, but by making this change the addition of a BOM for UTF-8 encoded
> generated files would now be an option for those who indeed do want it.
>
> Since the Xerces code is well written, the code modifications would be quite
> small to accommodate this change.
>
> I can make the changes and submit as a patch request, but first I would like
> to generate a discussion about this topic to help determine what the best
> implementation should be.  I'd ask that a pragmatic and realistic viewpoint
> rather than a hard-line spec viewpoint be adopted since the reality of BOMs
> for UTF-8 encoded files are out there and will not be going away.
>
> Thank you,
> _Nicholas
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 BOM generation option

Reply via email to