Here's my 0.02 in this issue: I think we should look at the safest route to take with this. As Nicholas stated, is that files containing this BOM is becoming more prevalent. So if that's the case, I personally think that Xerces (both Java and C versions) should just generate the BOM.
However, I also understand that this change could cause a potential problem. One situation I see is application using XML for some kind of inter-process communication, not necessarily XML-RPM or SOAP. So if we got one application using Xerces to parse the XML data received; and another one NOT using Xerces and NOT supporting the 3-byte BOM. If the Xerces-dependent application transmit the 3-byte BOM, will the other application handle the data properly or not? Hope this helps stir up the conversation, Keith On Dec 31, 2007 8:28 AM, <[EMAIL PROTECTED]> wrote: > Hello all, > > I sent this same email to the c-dev list. Its content applies from both > a user as well as a dev (mods) perspective, so I'm posting to this list > as well. > > ----------------- > > I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM > be added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use > this BOM when reading the XML file to determine encoding. The reality > of the situation is that a number of UTF-8 files do contain a BOM, and > this trend seems to becoming more prevalent (at least with the XML > datasets that I have been exposed to over the years) with time. > > Luckily, Xerces handles BOM markers for UTF-8 files already, there is > not a compatibility issue with being able to read their own generated > files. > > My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded > file if is explicitly asked to do so through the serializer (DOMWriter) > by setting the XMLUni::fgDOMWRTDOM feature. Most people won't set this > feature resulting in the current solution of generated UTF-8 files not > containing the BOM, but by making this change the addition of a BOM for > UTF-8 encoded generated files would now be an option for those who > indeed do want it. > > Since the Xerces code is well written, the code modifications would be > quite small to accommodate this change. > > I can make the changes and submit as a patch request, but first I would > like to generate a discussion about this topic to help determine what > the best implementation should be. I'd ask that a pragmatic and > realistic viewpoint rather than a hard-line spec viewpoint be adopted > since the reality of BOMs for UTF-8 encoded files are out there and will > not be going away. > > Thank you, > _Nicholas > > -- www.savedbycuriosity.com
