On 2019-08-22 13:42, Lars Knoll wrote:
That's why we are not removing QLatin1String: the Latin1 algorithm is as fast
as memcpy. The only thing better than that is zero copies.

We could also turn this around: Are we over-optimising here? Do we
have the right balance between ease of use and performance? Converting
utf8 is a bit more costly than latin1, but would that ever matter in
real world use cases?

Once we have proper support for u8 (in Qt, and C++ (char8_t)), we can certainly think about phasing out QLatin1String. Personally, I don't think the decoding performance between L1 and UTF-8 is the key here.

UTF-8 even has the nice property that it's closed under all text transformations in all locales, unlike L1 (toupper('ß') == ẞ ∉ L1, tolower('I') @ tr_TR = ı ∉ L1, ...). QUtfXXX would also greatly reduce the number of overloads of core string functions we need to provide (the same way as QStringView does already, if you consider QT_STRINGVIEW_LEVEL >= 2).

For me, the problem is QUtf8XXX::size() - what should that return?! IOW: what's the meaning of an index into a UTF-8 string? That extends to mid(), left(), right(), split(), ... In all current Qt string classes, size() returns the number of characters (ignoring surrogate pairs in QString, which we probably can live with because there are different ways to spell a ä in Unicode, too (ä, a + ¨), such that any serious text processing is anyway far removed from the simplistic 1 code point = 1 glyph pov, so surrogate pairs aren't much of an issue anymore). Whatever we do here, it will be downhill from where we are. Either size() is O(N) or a string (view) is no longer the size of a pointer (or two). That's 2x (50%+0) O(1) memory per string (view), and such stuff adds up over 1000s of strings...

So, maybe, at some point in the future, we can axe QLatin1String. But we need to seriously up UTF-8 support in Qt before that. QString is kind of in the way here, as UTF-16 has the bad side effect of endian dependence. If, say, .qm files were stored in UTF-8, tr() could return a QUtf8View. That's not possible with QString, unless apps come with two .qm files, one LE and BE.

One way to get out of this history pit was mentioned here and there on this ML before: we could have a QAnyString(View) (all names subject to bikeshedding), a string (view) that type-erases the encoding (like a std::variant<QUtf8String(View), QLatin1String(View), QString(View)>), which would be the type used in higher-level APIs (QLineEdit::setText(QAnyStringView)). I think std::filesystem::path got that quite right: you can feed it UTF-8 or UTF-16, and it will transparently convert to and from native API's encoding as needed.

But such a type has to be an _addition_ to, not a replacement of, encoding-dependent string types (proof: how do you process a QAnyString(View) if you're given one? Probably, keeping the std::variant simile, with a visitation mechanism, and the visitor is overloaded on the type. Sure, you can use (char8_t*, qsizetype) and (char16_t, qizetype) for that, but then we're back to a place we thought we'd never go back to after we got views: C-like string manipulation APIs.

Flame away...

Thanks,
Marc
_______________________________________________
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Reply via email to