Nicholas, The UTF-8 BOM debate is perhaps one of the largest wastes of time in computer science out there. Its also one of the reasons I've joined this list. So without further ado, let me add to the debate.
The java tools that build the xerces docs, at the ones least in the 2.8 svn tree, choke on the UTF-8 BOM. Thats only because they are old and no one wants to fix them. The UTF-8 BOM is unnecessary, and therefore should never be default IMHO Generally, a parser should be able to handle UTF-8 with out without the BOM, from a matter of practicality. So I think your suggestion falls within those guidelines. Regards, Justin Dearing On Dec 31, 2007 11:03 AM, <[EMAIL PROTECTED]> wrote: > > > Hello all, > > I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM be > added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use this > BOM when reading the XML file to determine encoding. The reality of the > situation is that a number of UTF-8 files do contain a BOM, and this trend > seems to becoming more prevalent (at least with the XML datasets that I have > been exposed to over the years) with time. > > Luckily, Xerces handles BOM markers for UTF-8 files already, there is not a > compatibility issue with being able to read their own generated files. > > My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded file > if is explicitly asked to do so through the serializer (DOMWriter) by > setting the XMLUni::fgDOMWRTDOM feature. Most people won't set this feature > resulting in the current solution of generated UTF-8 files not containing > the BOM, but by making this change the addition of a BOM for UTF-8 encoded > generated files would now be an option for those who indeed do want it. > > Since the Xerces code is well written, the code modifications would be quite > small to accommodate this change. > > I can make the changes and submit as a patch request, but first I would like > to generate a discussion about this topic to help determine what the best > implementation should be. I'd ask that a pragmatic and realistic viewpoint > rather than a hard-line spec viewpoint be adopted since the reality of BOMs > for UTF-8 encoded files are out there and will not be going away. > > Thank you, > _Nicholas > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
