On Monday, October 24, 2011 17:58:15 Simen Kjaeraas wrote: > On Mon, 24 Oct 2011 16:02:24 +0200, Steven Schveighoffer > > <schvei...@yahoo.com> wrote: > > On Sat, 22 Oct 2011 05:20:41 -0400, Walter Bright > > > > <newshou...@digitalmars.com> wrote: > >> On 10/22/2011 2:21 AM, Peter Alexander wrote: > >>> Which operations do you believe would be less efficient? > >> > >> All of the ones that don't require decoding, such as searching, would > >> be less efficient if decoding was done. > > > > Searching that does not do decoding is fundamentally incorrect. That > > is, if you want to find a substring in a string, you cannot just compare > > chars. > > Assuming both string are valid UTF-8, you can. Continuation bytes can never > be confused with the first byte of a code point, and the first byte always > identifies how many continuation bytes there should be.
Yes, but as far as iterating through, looking for a specific character goes, you can't simply search for it like you would search for an integer in an int[] unless you decode it. Techniques to search more efficiently exist in a number of cases as long as you understand unicode well enough, but as the default method of searching, it's just not going to work. And once you actually care about stuff on the level of graphemes (which admittedly Phobos doesn't do yet), you either have to decode everything, or searching becomes much more complicated. Really what it comes down to is that decoding by default will result in correct but less efficient code. Not decoding by default will inevitably result in incorrect code except in cases where people luck out (e.g. are only really dealing with ASCII) or where they know enough that they would have been specifically choosing to search on char for the first code unit in a code point and things of that variety in order to gain efficiency. There are just going to be fewer bugs if the default is correct but easily allows the programmer to use more efficient methods if they choose to. - Jonathan M Davis