That's a good response. I would add a couple of other factors: - What APIs will you be using? If most of the APIs take/return a particular UTF, the cost of constant conversions will swamp many if not most other performance considerations. - Asmus mentioned memory, but I'd like to add to that. When you are using virtual memory, significant increases in memory usage will cause a considerable slowdown because of swapping. This is especially important in server environments.
âMark ----- Original Message ----- From: "Asmus Freytag" <[EMAIL PROTECTED]> To: "Doug Ewell" <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Friday, December 03, 2004 07:55 Subject: Re: Nicest UTF > At 09:56 PM 12/2/2004, Doug Ewell wrote: > >I use ... and UTF-32 for most internal processing that I write > >myself. Let people say UTF-32 is wasteful if they want; I don't tend to > >store huge amounts of text in memory at once, so the overhead is much > >less important than one code unit per character. > > > For performance-critical applications on the other hand, you need to use > whichever UTF gives you the correct balance in speed and average storage > size for your data. > > If you have very large amounts of data, you'll be sensitive to cache > overruns. Enough so, that UTF-32 may be disqualified from the start. > I have encountered systems for which that was true. > > If your 'per character' operations are based on parsing for ASCII symbols, > e.g. HTML parsing, then both UTF-8 and UTF-16 allow you to process your > data directly, w/o need to worry about the longer sequences. For such > tasks, it may be that some processors will work faster if working in > 32-bit chunks. > > However, many 'inner loop' algorithms, such as copy, can be implemented > using native machine words, handling multiple characters, or parts of > characters, at once, independent of the UTF. > > And even in those situations, the savings from that better not be > offset by cache limitations. > > A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider > > 1) 1 extra test per character (to see whether it's a surrogate) > > 2) special handling every 100 to 1000 characters (say 10 instructions) > > 3) additional cost of accessing 16-bit registers (per character) > > 4) reduction in cache misses (each the equivalent of many instructions) > > 5) reduction in disk access (each the equivaletn of many many instructions) > > For many operations, e.g. string length, both 1, and 2 are no-ops, > so you need to apply a reduction factor based on the mix of operations > you do perform, say 50%-75%. > > For many processors, item 3 is not an issue. > > For 4 and 5, the multiplier is somewhere in the 100s or 1000s, for each > occurrence depending on the architecture. Their relative weight depends > not only on cache sizes, but also on how many other instructions per > character are performed. For text scanning operations, their cost > does predominate with large data sets. > > Given this little model and some additional assumptions about your > own project(s), you should be able to determine the 'nicest' UTF for > your own performance-critical case. > > A./ > > >