Re: UTF-8 BOM generation option

Keith Mendoza Mon, 31 Dec 2007 10:14:58 -0800

Here's my 0.02 in this issue: I think we should look at the safest route to
take with this. As Nicholas stated, is that files containing this BOM is
becoming more prevalent. So if that's the case, I personally think that
Xerces (both Java and C versions) should just generate the BOM.


However, I also understand that this change could cause a potential problem.
One situation I see is application using XML for some kind of inter-process
communication, not necessarily XML-RPM or SOAP. So if we got one application
using Xerces to parse the XML data received; and another one NOT using
Xerces and NOT supporting the 3-byte BOM. If the Xerces-dependent
application transmit the 3-byte BOM, will the other application handle the
data properly or not?

Hope this helps stir up the conversation,
Keith

On Dec 31, 2007 8:28 AM, <[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I sent this same email to the c-dev list.  Its content applies from both
> a user as well as a dev (mods) perspective, so I'm posting to this list
> as well.
>
> -----------------
>
> I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM
> be added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use
> this BOM when reading the XML file to determine encoding.  The reality
> of the situation is that a number of UTF-8 files do contain a BOM, and
> this trend seems to becoming more prevalent (at least with the XML
> datasets that I have been exposed to over the years) with time.
>
> Luckily, Xerces handles BOM markers for UTF-8 files already, there is
> not a compatibility issue with being able to read their own generated
> files.
>
> My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded
> file if is explicitly asked to do so through the serializer (DOMWriter)
> by setting the XMLUni::fgDOMWRTDOM feature.  Most people won't set this
> feature resulting in the current solution of generated UTF-8 files not
> containing the BOM, but by making this change the addition of a BOM for
> UTF-8 encoded generated files would now be an option for those who
> indeed do want it.
>
> Since the Xerces code is well written, the code modifications would be
> quite small to accommodate this change.
>
> I can make the changes and submit as a patch request, but first I would
> like to generate a discussion about this topic to help determine what
> the best implementation should be.  I'd ask that a pragmatic and
> realistic viewpoint rather than a hard-line spec viewpoint be adopted
> since the reality of BOMs for UTF-8 encoded files are out there and will
> not be going away.
>
> Thank you,
> _Nicholas
>
>


-- 
www.savedbycuriosity.com

Re: UTF-8 BOM generation option

Reply via email to