Philippe,Also a broken opening tag for HTML/XML documents
In addition to not having endian problems UTF-8 is also useful when tracing
intersystem communications data because XML and other tags are usually in
the ASCII subset of UTF-8 and stand out making it easier to find the
specific data you are looking for.
If you are working on XML documents without parsing them first, at least at the DOM level (I don't say after validation), then any generic string handling will likely fail, because you may break the XML wellformed-ness of the document.
Note however that you are not required to split the document into many string objects: you could as well create a DOM tree with nodes referencing pairs of offsets in the source document, if you had not to convert also the numeric character references.
If not doing so, you'll need to create subnodes within text elements, i.e. working at a level below the normal leaf level in DOM. But anyway, this is what you need to do when there are references to named entities that break the text level; but for simplicity, you would still need to parse CDATA sections to recreate single nodes that may be splitted by CDATA end/start markers inserted in a text stream that contains the "]]>" sequence of three characters.
Clearly, the normative syntax of XML comes first before any other interpretation of the data in individual parsed nodes as plain-text. So in this case, you'll need to create new string instances to store the parsed XML nodes in the DOM tree. Under this consideration, the encoding of the XML document itself plays a very small role, and as you'll need to create a separate "copy" for the parsed text, the encoding you'll choose for parsed nodes with which you can create a DOM tree can become independant of the encoding actually used in the source XML data, notably because XML allows many distinct encodings in multiple documents that have cross-references.
This means that implementing a conversion of the source encoding to the working encoding for DOM tree nodes cannot be avoided, unless you are limiting your parser to handle only some classes of XML documents (remember that XML uses UTF-8 as the default encoding, so you can't ignore it in any XML parser, even if you later decide to handle the parsed node data with UTF-16 or UTF-32).
Then a good question is which prefered central encoding you'll use for the parsed nodes: this depends on the Java parser API you use: if this API is written for C with byte-oriented null-terminated strings, UTF-8 will be that best representation (you may choose GB18030). if this API uses a wide-char C interface, UTF-16 or UTF-32 will most often be the only easy solution. In both cases, because the XML document may contain nodes with null bytes (represented by numeric character references like �), your API will need to return an actual string length.
Then what your application will do with the parsed nodes (i.e. whever it will build a DOM tree, or it will use nodes on the fly to create another document) is the application choice. If a DOM tree is built, an important factor will be the size of XML documents that you can represent and work with in memory for the global DOM tree nodes. Whever these nodes, built by the application, will be left in UTF-8 or UTF-16 or UTF-32, or stored with a more compact representation like SCSU is an application design.
If XML documents are very large, the size of the DOM tree will become also very large, and if your application then needs to perform complex transformation on the DOM tree, the constant needs to navigate in the tree will mean that therer will be frequent random accesses to the tree nodes. If the whole tree does not fit well in memory, this may sollicitate a lot the system memory manager, meaning many swaps on disk. Compressing nodes will help reduce the I/O overhead and will improve the data locality, meaning that the overhead of decompression costs will become much lower than the gain in performance caused by reduced system resource usage.
However, within the program itself UTF-8 presents a problem when looking for
specific data in memory buffers. It is nasty, time consuming and error
prone. Mapping UTF-16 to code points is a snap as long as you do not have a
lot of surrogates. If you do then probably UTF-32 should be considered.
This is not demonstrated by experience. Parsing UTF-8 or UTF-16 is not complex, even in the case of random accesses to the text data, because you always have a bounded and small limit to the number of steps needed to find the beginning offset of a fully encoded code point: for UTF-16, this means at most 1 range test and 1 possible backward step. For UTF-8, this limit for random accesses is at most 3 range tests and 3 possible backward steps. UTF-8 and UTF-16 are very easily supporting backwards and forwards enumerators; so what else do you need to perform any string handling?
From a cost to support there are valid reasons to use a mix of UTF formats.
For that I do agree, but not in the sense previously given in this list:
Different UTF encodings should not be mixed within the same plain-text element. But you can as well represent the various nodes (that are independant plain-text elements) in a built DOM tree with various encodings, to optimize their internal storage
You just need a common String interface (OO programming term) to access these nodes, and an implementation (or class) of this interface for each candidate string format. What these classes use in their internal backing store will then be transparent to the application that will just "see" Strings in a common unified (and most probably uncompressed) encoding.
You may as well reuse common strings using a hashset, or a non-broken java.lang.String.intern() transformation to atoms. Note however that for now in Java, the intern() method is broken for this usage because it does not scale well with large numbers of different strings, because it uses a special fast hashmap with a fixed but too limited number of hashbuckets, that store different strings with the same hash in a linked list; but the same is true and even worse also for the Windows CreateAtom() APIs which don't support collision lists for each hash bucket. Once again, this technic is usable independantly of the encoding you use for each string atom stored in the hashset, so they can still be stored in compressed format with the one-interface/multiple-classes technic.

