Philippe, >>However, within the program itself UTF-8 presents a >>problem when looking for specific data in memory buffers. >>It is nasty, time consuming and error prone. Mapping >>UTF-16 to code points is a snap as long as you >>do not have a lot of surrogates. If you do then probably >>UTF-32 should be considered. > > This is not demonstrated by experience. Parsing UTF-8 or > UTF-16 is not complex, even in the case of random accesses > to the text data, because you always have a bounded and > small limit to the number of steps needed to find > the beginning offset of a fully encoded code point: for > UTF-16, this means at most 1 range test and 1 possible > backward step. For UTF-8, this limit for random accesses > is at most 3 range tests and 3 possible backward steps. > UTF-8 and UTF-16 are very easily supporting backwards and > forwards enumerators; so what else do you need to perform > any string handling?
Sorry but I was unclear. I was thinking of raw data displays in hex. For example with a sniffer, debuggers or memory dump. In this case what is a very simple algorithm is not easy when you are manually converting from UTF-8 to code points by disassembling the hex to bits and recombining the bits to find the code points. With UTF-16 at best you may have to do a little endian flip of the hex digits except for surrogates which should be few. Because some dumps not only provide hex but also ASCII representations of data. UTF-8 is great to find tags like in XML. It allows you to analyze the tree because the tags show up in the ASCII side of the trace data display making is easy to find your specific data elements as well as finding missing tags, tree structure errors or problems with data that is not well formed. It is rare that systems use non-ASCII tags. Certainly since the tags are only used internally there is no reason that they can not be limited to ASCII just for improved support. Carl