Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Monday, 25 May 2020 04:37:26 PDT Edward Welbourne wrote: > The "comparisons" heading might stretch as far as using a UTF-8 key to > do a look-up in a QString-keyed hash, Using UTF-8 data to look up in a QString-keyed hash will require conversion to UTF-16 to calculate the hash. It can't be calculated on-the-fly. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
Thiago Macieira (23 May 2020 03:06) wrote: > Update: > > As we're reviewing the changes Lars is making to get rid of > QStringRef, Lars, Marc and I came to the conclusion that > QUtf8StringView is required for Qt 6.0. [snip] Sounds sensible. I would just call it QUtf8View, since (see below) I don't see value in a separate QUtf8String for it to be a view into, so making clear that it's a view, not backed by any particular string type, has value; but the detail of naming is less important. > There are currently no conclusions on QUtf8String and QAnyString, nor > on what the APIs should look like. I don't really see the need for an owning 8-bit string type (hence, equally, for QAnyString); we have QByteArray to serve as data-owner behind a UTF-8 view, when the data's not a C-string literal but is known to be UTF-8, and the simplicity of "when we store bytes with the semantics of text, we always do so in UTF-16" argues against doing anything more with UTF-8 views than supporting comparisons (including starts-with, ends-with, contains, index-of) and constructing a QString out of one. The "comparisons" heading might stretch as far as using a UTF-8 key to do a look-up in a QString-keyed hash, if doing so does actually bring a meaningful saving compared to converting to UTF-16 first; which, of course, might resurface in various other query APIs (asking for an HTTP header's value from an object packaging a map, or an HTML tag's attribute value). There are perhaps other places where it'll make sense for APIs taking a QStringView to also have a QUtf8View overload; but, crucially, by limiting UTF-8 to view-level support, we provide a bound on how widely it makes sense to burden our APIs with more overloads than just QString and/or up to two views. Eddy. ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: > There's only our own lazyness which stands in the way of this better > alternative. [snip the rest] Update: As we're reviewing the changes Lars is making to get rid of QStringRef, Lars, Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0. That's because some methods that previously returned QStringRef now return QStringView and to retain compatibility with: if (xml.attribute("foo") == "bar") where QXmlStreamReader::attribute() returns QStringView, we really need to capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to UTF-8 comparisons. So we're working on it. If it had been wrapped in QLatin1String(), there would be no compatibility issues, as there already is an operator==() for QStringView/QLatin1String. There are currently no conclusions on QUtf8String and QAnyString, nor on what the APIs should look like. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Friday, 15 May 2020 03:33:28 PDT Lars Knoll wrote: > Pretty much all uses of QL1String that I’ve seen are about ASCII only > content. That is certainly true for Qt itself, but also to a large degree > for our users. For those, utf-8 conversions are within 5% of latin1 > decoding. This makes it very clear to me that we should *not* have any > special handling for ascii that require a separate API. We don't want Latin1 content in our files. There are two reasons for having QLatin1String and not QAsciiString: 1) historical. It was added in 4.0 (2005) ,when a good fraction of people were still running 8-bit Latin1 or Latin9 as their locales. It was actually added as a replacemente for people writing macros like this in 3.x times: #define L1S(x) QString::fromLatin1(x) Additionially, we mis-purposed the name "Ascii" in Qt to mean "locale-encoded strings". 2) the Latin1 codec is FAST, but only because it needs to do no error checking. If we had a QAsciiString class or proper US-ASCII conversion functions, we'd get bug reports that something with a high bit set was not flagged and replaced with U+FFFD Replacement Character when converted. This error checking is similar to the UTF-8 decoding, which would make it as fast as the UTF-8 decoder in terms of performance for US-ASCII content. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
> On 15 May 2020, at 03:12, Thiago Macieira wrote: > > On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: >> Also, given a function like >> >>setFoo(const QByteArray &); >> >> what does this actually expect? An UTF-8 string? A local 8-bit string? >> An octet stream? A Latin-1 string? QByteArray is the jack of all these, >> master of none. What I would like to do right now for 6.0 is that all 8bit encoded text is assumed to be UTF-8. Simple as that. If it’s something else, the developer will have to take care of it himself. This is an important point for Qt 6.0 and independent of and QUtf8String we might or might not add later on. > > Like that, it's just "array of bytes of an arbitrary encoding (or none)". > There's still a reason to have QByteArray and it'll need to exist in > networking and file I/O code. That means the string classes, if any, need to > be convertible to QByteArray anyway. Agreed. > >> So, assuming the premiss that QByteArray should not be string-ish >> anymore, what do we want to have as the result type of QString::toUtf8() >> and QString::toLatin1()? Do we really want mere bytes? >> >> I don't think so. > > Since for Qt, String = UTF-16, then anything in another encoding is "a bag of > bytes". QByteArray does serve that purpose. > >> If Unicode succeeds, most I/O will be in the form of UTF-8. File names >> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 >> (as they are on Windows). It makes a _ton_ of sense to have a container >> for this, and C++20 tempts us with char8_t to do exactly that. I'd love >> to do string processing in UTF-8 without potentially doubling the >> storage requirements by first converting it to UTF-16, then doing the >> processing, then converting it back. What are we actually gaining by having another string class? Yes, UTF-8 is being used in many places. But are the gains of directly working on UTF-8 enough to justify the duplication of all our string related APIs and implementations? > > Unless you're processing Cyrillic or Greek text, in which case your memory > usage will be about the same. Or if you're processing CJK, in which case > UTF-16 is a 33% reduction in memory use. Correct. Utf-8 only saves space for content that is mostly ascii. But if you only need ascii text processing, you can just as well do it on the current QByteArray. > >> Qt should have a strong story not just for UTF-16, but also for UTF-8. > > So long as it's not confusing on which class to use, sure. If that means a > proliferation of overloads everywhere, we've gone wrong somewhere. +1. Almost all other programming languages out there have standardised on one class for unicode string/text handling. IMO this is the correct approach. The fact that we’re using UTF-16 is historical, but it’s not better or worse than UTF-8. Let’s make transcoding fast, and stop worrying about several encodings. > >> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, I’ll veto any UTF-32 string class. There is simply not a single good reason for using such a class. The only ‘advantage’ it has is one unicode code point per index, but that doesn’t help as unicode text processing anyways needs to look beyond that (at e.g. grapheme clusters etc). And it wastes lots of memory. >> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 >> operations are not much slower than L1 <-> utf16 ones (I heard Lars' >> team has them down to within 5% of each other, not sure that's >> possible). > > The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is > within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The > difference in performance is the need to check if the high bit is set. Both > codecs are vectorised with both SSE2 and AVX2 implementations. There are also > Neon implementations, but I don't know their benchmark numbers (note: the > UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit). > > For non-US-ASCII Latin1 text, the performance is more than 5% worse, > depending > on how dense the non-ASCII characters are in the string. But given that we > want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 > should be rare. > > I also have an implementation of UTF-16 to ASCII codec, which is the same as > UTF-16 to Latin1, but without error checking. That requires that the string > class store whether it contains only US-ASCII. I've never pushed this to Qt. Pretty much all uses of QL1String that I’ve seen are about ASCII only content. That is certainly true for Qt itself, but also to a large degree for our users. For those, utf-8 conversions are within 5% of latin1 decoding. This makes it very clear to me that we should *not* have any special handling for ascii that require a separate API. Conversion speed for non ascii content is something we can improve, there are various BSD licensed implementatio
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Thu, May 14, 2020 at 06:12:15PM -0700, Thiago Macieira wrote: That means the string classes, if any, need to be convertible to QByteArray anyway. yes, via QTextCodec. (behind the scenes some friend functions may be used for zero-copy conversions.) ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote: > Also, given a function like > > setFoo(const QByteArray &); > > what does this actually expect? An UTF-8 string? A local 8-bit string? > An octet stream? A Latin-1 string? QByteArray is the jack of all these, > master of none. Like that, it's just "array of bytes of an arbitrary encoding (or none)". There's still a reason to have QByteArray and it'll need to exist in networking and file I/O code. That means the string classes, if any, need to be convertible to QByteArray anyway. > So, assuming the premiss that QByteArray should not be string-ish > anymore, what do we want to have as the result type of QString::toUtf8() > and QString::toLatin1()? Do we really want mere bytes? > > I don't think so. Since for Qt, String = UTF-16, then anything in another encoding is "a bag of bytes". QByteArray does serve that purpose. > If Unicode succeeds, most I/O will be in the form of UTF-8. File names > on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 > (as they are on Windows). It makes a _ton_ of sense to have a container > for this, and C++20 tempts us with char8_t to do exactly that. I'd love > to do string processing in UTF-8 without potentially doubling the > storage requirements by first converting it to UTF-16, then doing the > processing, then converting it back. Unless you're processing Cyrillic or Greek text, in which case your memory usage will be about the same. Or if you're processing CJK, in which case UTF-16 is a 33% reduction in memory use. > Qt should have a strong story not just for UTF-16, but also for UTF-8. So long as it's not confusing on which class to use, sure. If that means a proliferation of overloads everywhere, we've gone wrong somewhere. > I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, > provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 > operations are not much slower than L1 <-> utf16 ones (I heard Lars' > team has them down to within 5% of each other, not sure that's > possible). The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The difference in performance is the need to check if the high bit is set. Both codecs are vectorised with both SSE2 and AVX2 implementations. There are also Neon implementations, but I don't know their benchmark numbers (note: the UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit). For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending on how dense the non-ASCII characters are in the string. But given that we want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 should be rare. I also have an implementation of UTF-16 to ASCII codec, which is the same as UTF-16 to Latin1, but without error checking. That requires that the string class store whether it contains only US-ASCII. I've never pushed this to Qt. > Anyway, we'd have two class templates, and they'd just be > instantiated with different Char types to flesh out all of the above, > with the exception of the byte array ones: > >using QUtf8String = QBasicString; >using QString = QBasicString; >using QLatin1String = QBasicString; >(using QByteArray = QVector;) BTW, I've said this before: QVector should over-allocate by one element and memset it to zero, if the element is small enough (4 or 8 bytes). This should be done behind the scenes, so the API would never notice it. But it would allow transferring the ownership of a QByteArray's payload to any of the other classes and still have a null-terminated string. I don't mind having a QUtf8String{,View} but there needs to be a limit into how much we add to its API. Do we have indexOf(char32_t) optimised with vectorisation? Do we have indexOf(QRegularExpression)? The latter would make us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly conversions and memory allocations. If your objective is to speed things up, having too many methods may actually make it worse. And then there's the overload set for generic functions. I'm going to insist a single, clear rule that does not depend on implementation details and is reasonably future-proof. It has to be about *what* the function does, not *how* it does that. > If, after getting all of the above runnig, we _then_ want The One String > (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we > need a QAnyString), which can contain any of the 2-4 string (view) > classes above (but not QByteArray(View)), but which doesn't have > string-ish API. Instead, you need to inspect it to extract the actual > string class (QLatin1String, QUtf8String, QString) contained, or simply > ask for the one you want, and it will convert, if necessary. Excluding QLatin1String since I don't think we need that, I'm willing to see this effo