Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-25 Thread Thiago Macieira
On Monday, 25 May 2020 04:37:26 PDT Edward Welbourne wrote:
> The "comparisons" heading might stretch as far as using a UTF-8 key to
> do a look-up in a QString-keyed hash,

Using UTF-8 data to look up in a QString-keyed hash will require conversion to 
UTF-16 to calculate the hash. It can't be calculated on-the-fly.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-25 Thread Edward Welbourne
Thiago Macieira (23 May 2020 03:06) wrote:
> Update:
>
> As we're reviewing the changes Lars is making to get rid of
> QStringRef, Lars, Marc and I came to the conclusion that
> QUtf8StringView is required for Qt 6.0.
[snip]

Sounds sensible.
I would just call it QUtf8View, since (see below) I don't see value in a
separate QUtf8String for it to be a view into, so making clear that it's
a view, not backed by any particular string type, has value; but the
detail of naming is less important.

> There are currently no conclusions on QUtf8String and QAnyString, nor
> on what the APIs should look like.

I don't really see the need for an owning 8-bit string type (hence,
equally, for QAnyString); we have QByteArray to serve as data-owner
behind a UTF-8 view, when the data's not a C-string literal but is known
to be UTF-8, and the simplicity of "when we store bytes with the
semantics of text, we always do so in UTF-16" argues against doing
anything more with UTF-8 views than supporting comparisons (including
starts-with, ends-with, contains, index-of) and constructing a QString
out of one.

The "comparisons" heading might stretch as far as using a UTF-8 key to
do a look-up in a QString-keyed hash, if doing so does actually bring a
meaningful saving compared to converting to UTF-16 first; which, of
course, might resurface in various other query APIs (asking for an HTTP
header's value from an object packaging a map, or an HTML tag's
attribute value).

There are perhaps other places where it'll make sense for APIs taking a
QStringView to also have a QUtf8View overload; but, crucially, by
limiting UTF-8 to view-level support, we provide a bound on how widely
it makes sense to burden our APIs with more overloads than just QString
and/or up to two views.

Eddy.
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-22 Thread Thiago Macieira
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
> There's only our own lazyness which stands in the way of this better
> alternative.
[snip the rest]

Update:

As we're reviewing the changes Lars is making to get rid of QStringRef, Lars, 
Marc and I came to the conclusion that QUtf8StringView is required for Qt 6.0. 
That's because some methods that previously returned QStringRef now return 
QStringView and to retain compatibility with:

if (xml.attribute("foo") == "bar")

where QXmlStreamReader::attribute() returns QStringView, we really need to 
capture that "bar" as a UTF-8 string and we ought to have optimised UTF-16 to 
UTF-8 comparisons. So we're working on it.

If it had been wrapped in QLatin1String(), there would be no compatibility 
issues, as there already is an operator==() for QStringView/QLatin1String.

There are currently no conclusions on QUtf8String and QAnyString, nor on what 
the APIs should look like.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-15 Thread Thiago Macieira
On Friday, 15 May 2020 03:33:28 PDT Lars Knoll wrote:
> Pretty much all uses of QL1String that I’ve seen are about ASCII only
> content. That is certainly true for Qt itself, but also to a large degree
> for our users. For those, utf-8 conversions are within 5% of latin1
> decoding. This makes it very clear to me that we should *not* have any
> special handling for ascii that require a separate API.

We don't want Latin1 content in our files. There are two reasons for having 
QLatin1String and not QAsciiString:

1) historical. It was added in 4.0 (2005) ,when a good fraction of people were 
still running 8-bit Latin1 or Latin9 as their locales. It was actually added 
as a replacemente for people writing macros like this in 3.x times:

#define L1S(x)  QString::fromLatin1(x)

Additionially, we mis-purposed the name "Ascii" in Qt to mean "locale-encoded 
strings".

2) the Latin1 codec is FAST, but only because it needs to do no error 
checking. If we had a QAsciiString class or proper US-ASCII conversion 
functions, we'd get bug reports that something with a high bit set was not 
flagged and replaced with U+FFFD Replacement Character when converted. This 
error checking is similar to the UTF-8 decoding, which would make it as fast 
as the UTF-8 decoder in terms of performance for US-ASCII content.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-15 Thread Lars Knoll
> On 15 May 2020, at 03:12, Thiago Macieira  wrote:
> 
> On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
>> Also, given a function like
>> 
>>setFoo(const QByteArray &);
>> 
>> what does this actually expect? An UTF-8 string? A local 8-bit string?
>> An octet stream? A Latin-1 string? QByteArray is the jack of all these,
>> master of none.

What I would like to do right now for 6.0 is that all 8bit encoded text is 
assumed to be UTF-8. Simple as that. If it’s something else, the developer will 
have to take care of it himself. This is an important point for Qt 6.0 and 
independent of and QUtf8String we might or might not add later on.
> 
> Like that, it's just "array of bytes of an arbitrary encoding (or none)". 
> There's still a reason to have QByteArray and it'll need to exist in 
> networking and file I/O code. That means the string classes, if any, need to 
> be convertible to QByteArray anyway.

Agreed.
> 
>> So, assuming the premiss that QByteArray should not be string-ish
>> anymore, what do we want to have as the result type of QString::toUtf8()
>> and QString::toLatin1()? Do we really want mere bytes?
>> 
>> I don't think so.
> 
> Since for Qt, String = UTF-16, then anything in another encoding is "a bag of 
> bytes". QByteArray does serve that purpose.
> 
>> If Unicode succeeds, most I/O will be in the form of UTF-8. File names
>> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
>> (as they are on Windows). It makes a _ton_ of sense to have a container
>> for this, and C++20 tempts us with char8_t to do exactly that. I'd love
>> to do string processing in UTF-8 without potentially doubling the
>> storage requirements by first converting it to UTF-16, then doing the
>> processing, then converting it back.

What are we actually gaining by having another string class? Yes, UTF-8 is 
being used in many places. But are the gains of directly working on UTF-8 
enough to justify the duplication of all our string related APIs and 
implementations?
> 
> Unless you're processing Cyrillic or Greek text, in which case your memory 
> usage will be about the same. Or if you're processing CJK, in which case 
> UTF-16 is a 33% reduction in memory use.

Correct. Utf-8 only saves space for content that is mostly ascii. But if you 
only need ascii text processing, you can just as well do it on the current 
QByteArray.
> 
>> Qt should have a strong story not just for UTF-16, but also for UTF-8.
> 
> So long as it's not confusing on which class to use, sure. If that means a 
> proliferation of overloads everywhere, we've gone wrong somewhere.

+1. 

Almost all other programming languages out there have standardised on one class 
for unicode string/text handling. IMO this is the correct approach. The fact 
that we’re using UTF-16 is historical, but it’s not better or worse than UTF-8. 
Let’s make transcoding fast, and stop worrying about several encodings.
> 
>> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,

I’ll veto any UTF-32 string class. There is simply not a single good reason for 
using such a class. The only ‘advantage’ it has is one unicode code point per 
index, but that doesn’t help as unicode text processing anyways needs to look 
beyond that (at e.g. grapheme clusters etc). And it wastes lots of memory.

>> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
>> operations are not much slower than L1 <-> utf16 ones (I heard Lars'
>> team has them down to within 5% of each other, not sure that's
>> possible). 
> 
> The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is 
> within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The 
> difference in performance is the need to check if the high bit is set. Both 
> codecs are vectorised with both SSE2 and AVX2 implementations. There are also 
> Neon implementations, but I don't know their benchmark numbers (note: the 
> UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit).
> 
> For non-US-ASCII Latin1 text, the performance is more than 5% worse, 
> depending 
> on how dense the non-ASCII characters are in the string. But given that we 
> want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 
> should be rare.
> 
> I also have an implementation of UTF-16 to ASCII codec, which is the same as 
> UTF-16 to Latin1, but without error checking. That requires that the string 
> class store whether it contains only US-ASCII. I've never pushed this to Qt.

Pretty much all uses of QL1String that I’ve seen are about ASCII only content. 
That is certainly true for Qt itself, but also to a large degree for our users. 
For those, utf-8 conversions are within 5% of latin1 decoding. This makes it 
very clear to me that we should *not* have any special handling for ascii that 
require a separate API.

Conversion speed for non ascii content is something we can improve, there are 
various BSD licensed implementatio

Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-15 Thread Oswald Buddenhagen

On Thu, May 14, 2020 at 06:12:15PM -0700, Thiago Macieira wrote:
That means the string classes, if any, need to be convertible to 
QByteArray anyway.



yes, via QTextCodec.
(behind the scenes some friend functions may be used for zero-copy 
conversions.)

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] QUtf8String{, View} (was: Re: QString and related changes for Qt 6)

2020-05-14 Thread Thiago Macieira
On Thursday, 14 May 2020 07:41:45 PDT Marc Mutz via Development wrote:
> Also, given a function like
> 
> setFoo(const QByteArray &);
> 
> what does this actually expect? An UTF-8 string? A local 8-bit string?
> An octet stream? A Latin-1 string? QByteArray is the jack of all these,
> master of none.

Like that, it's just "array of bytes of an arbitrary encoding (or none)". 
There's still a reason to have QByteArray and it'll need to exist in 
networking and file I/O code. That means the string classes, if any, need to 
be convertible to QByteArray anyway.

> So, assuming the premiss that QByteArray should not be string-ish
> anymore, what do we want to have as the result type of QString::toUtf8()
> and QString::toLatin1()? Do we really want mere bytes?
> 
> I don't think so.

Since for Qt, String = UTF-16, then anything in another encoding is "a bag of 
bytes". QByteArray does serve that purpose.

> If Unicode succeeds, most I/O will be in the form of UTF-8. File names
> on Unix are UTF-8 (for all intents and purposes these days), not UTF-16
> (as they are on Windows). It makes a _ton_ of sense to have a container
> for this, and C++20 tempts us with char8_t to do exactly that. I'd love
> to do string processing in UTF-8 without potentially doubling the
> storage requirements by first converting it to UTF-16, then doing the
> processing, then converting it back.

Unless you're processing Cyrillic or Greek text, in which case your memory 
usage will be about the same. Or if you're processing CJK, in which case 
UTF-16 is a 33% reduction in memory use.

> Qt should have a strong story not just for UTF-16, but also for UTF-8.

So long as it's not confusing on which class to use, sure. If that means a 
proliferation of overloads everywhere, we've gone wrong somewhere.

> I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one,
> provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16
> operations are not much slower than L1 <-> utf16 ones (I heard Lars'
> team has them down to within 5% of each other, not sure that's
> possible). 

The conversion of US-ASCII content using either fromUtf8 or fromLatin1 is 
within 5% of the other. The UTF-8 codec is optimised towards US-ASCII. The 
difference in performance is the need to check if the high bit is set. Both 
codecs are vectorised with both SSE2 and AVX2 implementations. There are also 
Neon implementations, but I don't know their benchmark numbers (note: the 
UTF-8 Neon code is AArch64 only, while the Latin1 also runs on 32-bit).

For non-US-ASCII Latin1 text, the performance is more than 5% worse, depending 
on how dense the non-ASCII characters are in the string. But given that we 
want our files to be encoded in UTF-8 anyway, decoding of non-ASCII Latin1 
should be rare.

I also have an implementation of UTF-16 to ASCII codec, which is the same as 
UTF-16 to Latin1, but without error checking. That requires that the string 
class store whether it contains only US-ASCII. I've never pushed this to Qt.

> Anyway, we'd have two class templates, and they'd just be
> instantiated with different Char types to flesh out all of the above,
> with the exception of the byte array ones:
> 
>using QUtf8String = QBasicString;
>using QString = QBasicString;
>using QLatin1String = QBasicString;
>(using QByteArray = QVector;)

BTW, I've said this before: QVector should over-allocate by one element and 
memset it to zero, if the element is small enough (4 or 8 bytes). This should 
be done behind the scenes, so the API would never notice it. But it would 
allow transferring the ownership of a QByteArray's payload to any of the other 
classes and still have a null-terminated string.

I don't mind having a QUtf8String{,View} but there needs to be a limit into 
how much we add to its API. Do we have indexOf(char32_t) optimised with 
vectorisation? Do we have indexOf(QRegularExpression)? The latter would make 
us link to libpcre2-8 in addition to libpcre2-16 or would require on-the-fly 
conversions and memory allocations. If your objective is to speed things up, 
having too many methods may actually make it worse.

And then there's the overload set for generic functions. I'm going to insist a 
single, clear rule that does not depend on implementation details and is 
reasonably future-proof. It has to be about *what* the function does, not 
*how* it does that.

> If, after getting all of the above runnig, we _then_ want The One String
> (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we
> need a QAnyString), which can contain any of the 2-4 string (view)
> classes above (but not QByteArray(View)), but which doesn't have
> string-ish API. Instead, you need to inspect it to extract the actual
> string class (QLatin1String, QUtf8String, QString) contained, or simply
> ask for the one you want, and it will convert, if necessary.

Excluding QLatin1String since I don't think we need that, I'm willing to see 
this effo