Hi Lars,

On 2020-05-12 09:49, Lars Knoll wrote:
[...]
One open question is whether we should add a QUtf8String with a
char8_t. I am not yet convinced that we actually need the class
though.
[...]

I positively want to stop using QByteArray as the QUtf8String that it currently is. QByteArray should lose all notion of string-ness (deprecate toLower() etc, remove in Qt 7) and be a QVector<std::byte>. Not sure we'll get there for Qt 6, not sure we'll get there with the name QByteArray, but that should be the end game for this class.

The networking code is full of uses of QByteArray and due to the lack of QByteArrayRef (QStringRef) or QByteArrayView (QStringView), it's splitting and substringing is much less performant than it could be.

Also, given a function like

   setFoo(const QByteArray &);

what does this actually expect? An UTF-8 string? A local 8-bit string? An octet stream? A Latin-1 string? QByteArray is the jack of all these, master of none.

So, assuming the premiss that QByteArray should not be string-ish anymore, what do we want to have as the result type of QString::toUtf8() and QString::toLatin1()? Do we really want mere bytes?

I don't think so.

If Unicode succeeds, most I/O will be in the form of UTF-8. File names on Unix are UTF-8 (for all intents and purposes these days), not UTF-16 (as they are on Windows). It makes a _ton_ of sense to have a container for this, and C++20 tempts us with char8_t to do exactly that. I'd love to do string processing in UTF-8 without potentially doubling the storage requirements by first converting it to UTF-16, then doing the processing, then converting it back.

Qt should have a strong story not just for UTF-16, but also for UTF-8.

I've talked about this on QtWS, but here's TL;DV: of it:

value_type container view string-ish API?

char / QLatin1Char    — QLatinString — QLatin1StringView — yes
char8_t / qchar8      — QUtf8String  — QUtf8StringView   — yes
char16_t / QChar      — QString      — QStringView       — yes
(char32_t             ­— QUtf32String — QUtf32StringView  — yes)

std::byte             — QByteArray   — QByteArrayView    ­— NO

I'm not sure we need the utf32 one, and I'm ok with dropping the L1 one, provided a) we can depend on char8_t (ie. Qt 7) and b) utf-8 <-> utf16 operations are not much slower than L1 <-> utf16 ones (I heard Lars' team has them down to within 5% of each other, not sure that's possible). Anyway, we'd have two class templates, and they'd just be instantiated with different Char types to flesh out all of the above, with the exception of the byte array ones:

  using QUtf8String = QBasicString<char8_t>;
  using QString = QBasicString<char16_t>;
  using QLatin1String = QBasicString<char>;
  (using QByteArray = QVector<std::byte>;)

If, after getting all of the above runnig, we _then_ want The One String (View) To Rule Them All, then I'd suggest QAnyString{,View} (not sure we need a QAnyString), which can contain any of the 2-4 string (view) classes above (but not QByteArray(View)), but which doesn't have string-ish API. Instead, you need to inspect it to extract the actual string class (QLatin1String, QUtf8String, QString) contained, or simply ask for the one you want, and it will convert, if necessary.

With this, your typical Qt function taking strings would look like this:

   QLineEdit::setText(QAnyStringView text)
   {
       Q_D(QLineEdit);
if (text == d->text) // mixed-mode comparisons are supported out of the box
           return;
d->text = text.toString(); // centralized conversion to QString (in library, not user code) // also available: toLatin1(), toUtf8()
       update();
   }

Callers now have total freedom in what to pass:

   le->setText("Hi");
   le->setText(u"Hi");
   le->setText(u8"Hi");
   le->setText(u"Hi"s);
   le->setText(u8"Hi"sv);
   le->setText(QVarLengthArray{'H', 'i'});
   le->setText("Hello" % ", World"); // QStringBuilder

and they'd all result in optimal code, because QAnyStringView is a trivial type (in the C++ sense), which means, unlike QString, it can be passed in CPU registers instead of on the stack.

Likewise, parsing code could do

   Meep parseMeep(QAnyStringView str)
   {
       return str.visit([](auto str) {
           Meep meep;
           for (auto me : str.tokenize(u'\n'))
              meep += parse(me);
           return meep;
       });
   }

iow: instead of a bunch of overloads, you write your code as a template and let QAnyStringView instantiate your lambda with the actual type of string view passed.

As a further example, here's op== for QAnyStringView (provided by Qt):

   bool operator==(QAnyStringView lhs, QAnyStringView rhs) noexcept
   {
       return lhs.visit([rhs](auto lhs) {
           return rhs.visit([lhs](auto rhs) {
               return lhs == rhs;
           });
       });
   }

Last year, I heard someone (don't remember whom) suggest this for QString. That is: allow QString to hold UTF-16 or UTF-8 data. I'd classify this idea as another over-my-dead-body (which, btw, is semi-official ISO speak for "strong objection"). As I'm wont to say: An API doesn't become easy to use by minimizing the number of classes, but by minimizing the number of responsibilities per class, even if that means many more small classes than one big.

I would add, as I've done before, and even Matthew said, that I'd be very wary of folding QStringView into QString. I can understand the urge to not have to go and s/QString/QStringView/ in many places (or s/QString/QAnyStringView/), but it is my firm belief that it would make Qt much easier and convenient to use if we didn't put all those responsibilities on QString.

There's only our own lazyness which stands in the way of this better alternative.

Thanks,
Marc
_______________________________________________
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development

Reply via email to