16 bits is completely enough for most spoken languages (see the Unicode's Blocks.txt and/or Scripts.txt for an approximated list), whereas 8 bits encoding only covers ASCII. Despite of what http://utf8everywhere.org/#conclusions says, UTF-16 is not the worst choice; it is a trade-off between the performance and the memory consumption in the most-common use case (spoken languages and mixed scripts).
Konstantin 2015-02-10 21:55 GMT+04:00 Rutledge Shawn <shawn.rutle...@theqtcompany.com>: > > On Feb 10, 2015, at 17:08, Julien Blanc <julien.bl...@nmc-company.com> > wrote: > > > On 10/02/2015 16:33, Knoll Lars wrote: > >> IMO there’s simply too many questions that this one example doesn’t > answer > >> to conclude that what we are doing is bad. > > > > Two arguments : > > - implicit sharing is convenient, and really developer friendly. It is > > probably a good idea since strings are really present a lot in signals > > and slots (and afaik, passed by value in these context) > > - implicit sharing is implicit, you don’t have the choice not to pay for > > it, which is a bad thing. > > > > From my experience, QStrings are slow. About two times slower than > > using plain std::string in our use cases, but the culprit for this > > slowness is, as far as we know, the internal 16 bits encoding, whereas > > our data sources are all using utf-8. We have no evidence that the > > implicit sharing cost is significant or not. > > Should we try to use UTF-8 in some future version of Qt? I’ve wondered > that for a while: 16 bits is not enough for any possible Unicode character, > whereas 32 bits would be; and yet 8 bits is enough most of the time. Isn’t > 16 bits the worst choice then? (some bloat for European languages, and > algorithmic inefficiency for others) With 32-bit characters, operator[] is > always O(1). If we use UTF-8, the code would often have to iterate a > variable number of bytes to get to the next character. But is it worth it > to save memory? Especially considering the point from earlier that > operations on data which fit entirely within cache memory will be so much > faster that it swamps the O(whatever) efficiency of some algorithms: > keeping strings as small as possible should be a good thing. And maybe > there are some clever tricks to get faster character indexing, using > bitfields or binary search or an occasional weak reference to a 32-bit > decoded version when it’s really needed. Emphasizing use of iterators > instead of operator[] would help too. > > I googled "utf-8 character indexing" and the top hit was this (which I’ve > probably seen before): http://utf8everywhere.org/ > _______________________________________________ > Development mailing list > Development@qt-project.org > http://lists.qt-project.org/mailman/listinfo/development >
_______________________________________________ Development mailing list Development@qt-project.org http://lists.qt-project.org/mailman/listinfo/development