Re: [Development] char8_t summary?
On Tuesday, 16 July 2019 09:11:37 PDT Matthew Woehlke wrote: > On 15/07/2019 18.19, Thiago Macieira wrote: > > On Monday, 15 July 2019 09:41:24 PDT Matthew Woehlke wrote: > >> Note also that I suggested having the template definition out-of-line; > >> it doesn't need to be in (e.g.) qstring.h or anywhere that will affect > >> *user* compile times. Only the TU responsible for instantiating them > >> would be affected, and that should be negligible in the grand scheme of > >> things. > > > > Then it's no different than an overload, if the implementation isn't the > > same (and it isn't). > > ...but a template allows the common portions to be written in a single > definition with overloads *and/or* `if constexpr` used where the code > needs to differ. Regular overloads would require 100% of the definition > to be duplicated for each overload. And what Marc and I are arguing is that the common portions are small enough not to be worth the hassle of a template in the first place. > Concrete example: > > // .h > bool contains(QStringView); > bool contains(QLatin1StringView); [cut] Two things: 1) templatisation of contains, indexOf, startsWith, etc. is already being done in dev 2) the work being done *and* your example are UTF-16 and Latin1 only. The whole issue here is that *UTF-8* will not share enough code. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-16 18:11, Matthew Woehlke wrote: [...] The basic algorithm (iterate through 'haystack' looking for 'needle') is common regardless of the string types. The points that differ (e.g. only starting the search at code points, computing lengths) use overloaded helper functions which can be inline (e.g. q_next_codepoint for some types will just be operator++) and optimized. Please square me that with this comment from qstring.cpp: // we're going to read a[0..15] and b[0..15] (32 bytes) Thanks, Marc ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 15/07/2019 18.19, Thiago Macieira wrote: > On Monday, 15 July 2019 09:41:24 PDT Matthew Woehlke wrote: >> Note also that I suggested having the template definition out-of-line; >> it doesn't need to be in (e.g.) qstring.h or anywhere that will affect >> *user* compile times. Only the TU responsible for instantiating them >> would be affected, and that should be negligible in the grand scheme of >> things. > > Then it's no different than an overload, if the implementation isn't the same > (and it isn't). ...but a template allows the common portions to be written in a single definition with overloads *and/or* `if constexpr` used where the code needs to differ. Regular overloads would require 100% of the definition to be duplicated for each overload. In terms of *declarations*, yes, you are going to have the same number of declarations. However, those are only one line, and potentially can be generated for each string type using a macro, so O(M+N) (M = methods, N = string types) rather than O(M*N) actual source lines. (Granted, you could do this for plain overload declarations also, but a) this probably doesn't play as well with documentation, and b) you still have to write O(M*N) definitions rather than O(M).) Concrete example: // .h bool contains(QStringView); bool contains(QLatin1StringView); // .cpp bool contains(QStringView needle) { ... } bool contains(QStringView needle) { ... } - vs - // .h template bool contains(T); extern template bool contains(QStringView); extern template bool contains(QLatin1StringView); // .cpp template bool contains(T needle) { int const l = needle.chars(); int i = 0; ... // computation of went_too_far elided while (i < went_too_far) { if (q_compare_strings(this->midRef(i), needle, l) return true; i = q_next_codepoint(this, i); } return false; } template bool contains(QStringView); template bool contains(QLatin1StringView); Keep in mind also that this method lives in a notional (templated?) QGenericString base class and/or is actually a helper function, i.e. it is also templated on the type of this/haystack... thus I have this *one and only one* definition of 'contains', rather than O(N²) definitions. Hopefully this presents a plausible example of common code. The basic algorithm (iterate through 'haystack' looking for 'needle') is common regardless of the string types. The points that differ (e.g. only starting the search at code points, computing lengths) use overloaded helper functions which can be inline (e.g. q_next_codepoint for some types will just be operator++) and optimized. It's also likely that these helpers will be used in multiple methods. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Monday, 15 July 2019 09:41:24 PDT Matthew Woehlke wrote: > Note also that I suggested having the template definition out-of-line; > it doesn't need to be in (e.g.) qstring.h or anywhere that will affect > *user* compile times. Only the TU responsible for instantiating them > would be affected, and that should be negligible in the grand scheme of > things. Then it's no different than an overload, if the implementation isn't the same (and it isn't). -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 14/07/2019 02.28, Mutz, Marc via Development wrote: > If you're still not convinced, here's QStringView::endsWith() as a > template: > > template > requires std::is_convertible_v Qtf8StringView, || ... QLatin1StringView ... > Q_ALWAYS_INLINE > bool endsWith(Prefix ) const { > return QtPrivate::endsWith(*this, > QtPrivate::qStringLikeToStringView(p)); > } > > with a qStringLikeToStringView() similar to the one in 181620. This uses > C++20, and I'm sure it loses something over the current implementation. > Qt::CaseSensitivity comes to mind. ...and I don't know why you didn't just propagate through the case sensitivity argument? > To anyone speaking up in favour of > the box: Please write this in C++11 before you hit reply :) IIUC, replacing the `requires` is trivial. A bit ugly, sure, but not difficult. I also question the value of the indirection in the above. Moving the implementation of QtPrivate::endsWith to be inline, and making use of `if constexpr` where useful, will hopefully reduce the total amount of code. (Yes, eventually you're going to have an optimized string comparison. Helper code like that to implement the critical code paths will still exist, but hopefully those are bits that get used over and over in many methods.) Note also that I suggested having the template definition out-of-line; it doesn't need to be in (e.g.) qstring.h or anywhere that will affect *user* compile times. Only the TU responsible for instantiating them would be affected, and that should be negligible in the grand scheme of things. BTW, I don't think ternary functions are an issue. The ones that come to mind will "always"¹ need to convert one of their arguments anyway, so while the *templates* may involve another level of combinatorics, that level won't affect the implementation complexity in any meaningful way. (¹ Possibly they can skip this because that argument is never actually used, but otherwise it must be converted.) -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 13/07/2019 21:39, Volker Hilsheimer wrote: With an (ideally) single template-based API we don’t have peopleusing Qt get lost in the jungle for overloads and string classes. For the implementation, we can specialise the templates to call the suitable internal functions that implement the various algorithms. This is basically a Qt 7 idea, raised some time ago: a string class that is a collection of code points under "some" Unicode encoding, transparently wrapping UTF-8 / 16 / 32 sequences without extra copies. Functions between these strings dispatch to the right overload, in a manner that is totally invisible for the user. Similarly, the high level API works in terms of code points, not units. Only if one wants to get the hands dirty then one can query and extract the actual encoded data. My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - The Qt, C++ and OpenGL Experts smime.p7s Description: S/MIME Cryptographic Signature ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 13/07/2019 15.39, Volker Hilsheimer wrote: > As I understood the template suggestion, it’s more about not having > to add 64 different overloads (or several more string classes) to > the Qt API, and less about unifying all implementations into a single > set of algorithms. Right. At some point you are going to call out to specialized functions (e.g. qt_compre_strings as Marc mentioned). The thought was to have a (more modest) set of these specialized helpers with the generic bits implemented as template logic. Probably with a bunch of `if constexpr` branches to perform optimizations when possible. > On 13/07/2019 07.41, Thiago Macieira wrote:> Again, note how the template > implicitly assumes things. A 3-character string >> cannot be present at the beginning (startsWith), end (endsWith) or anywhere >> in >> the middle (contains, indexOf, lastIndexOf) of a 2-character one, for >> example. >> >> But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 >> string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 >> character). The correct fix for that is to count code points, not characters. Possibly this means that such optimization should be behind an 'if constexpr' to only use it when it is safe to do so. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Sun, Jul 14, 2019 at 08:28:58AM +0200, Mutz, Marc via Development wrote: > > As I understood the template suggestion, it’s more about not having to > > add 64 different overloads (or several more string classes) to the Qt > > API, and less about unifying all implementations into a single set of > > algorithms. > > [I'm replying to Volker, but this should be read as replying to everyone, > and 'you' should be read as the plural form] Thanks for this clarification. It really helps. > [...] > But that doesn't reduce the number of overloads. Has having this thin wrapper around the "usual suspects" of string-like arguments as normal case of argument passing been considered? This could be something like a string view with a few bits spent on encoding infomation This effectively eats the "free" implicit type conversion when passing an argument but history (introduction of QStringBuilder) has shown that while not completely source compatible, it was fairly harmless. QStringBuilder itself already uses up the free conversion, but could get an operator xxx() to produce the new argument tyoe, even can keep track of the encodings of the parts to help with provide the right encoding bits. >template >requires std::is_convertible_v Qtf8StringView, || ... QLatin1StringView ... >Q_ALWAYS_INLINE >bool endsWith(Prefix ) const { >return QtPrivate::endsWith(*this, > QtPrivate::qStringLikeToStringView(p)); >} > > with a qStringLikeToStringView() similar to the one in 181620. The looks kind of related, just that the qStringLikeToStringView() should not need to be end up explicitly written multiple times on the receiver side, but be done implicitly in the conversion of the arguments in the function call. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-13 21:39, Volker Hilsheimer wrote: On 13 Jul 2019, at 13:41, Thiago Macieira wrote: On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote: That said, I took a look at startsWith, and... surprise! It is *already a template*. So at least in that case, it isn't obvious why adding more combinations would be so terribly onerous. Again, note how the template implicitly assumes things. A 3-character string cannot be present at the beginning (startsWith), end (endsWith) or anywhere in the middle (contains, indexOf, lastIndexOf) of a 2-character one, for example. But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 character). That means implementing UTF-8 functions requires different algorithms in the first place. That means templates are not usually the answer. I'm not saying impossible. You can, by writing sufficiently generic algorithms that scan the strings in lockstep (you can scan UTF-8 backwards, after all). But the reason you don't *want* to is that our Latin1 and UTF-16 algorithms are optimised, often vectorised, for their purpose. We don't want to lose the efficiency we've already got. And I'm not saying we shouldn't have UTF-8 algorithms or even a QUtf8StringView or some such. It would have helped in CBOR, for example, see QCborStreamWriter: void appendTextString(const char *utf8, qsizetype len); This is one that should at least get the overload. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products As I understood the template suggestion, it’s more about not having to add 64 different overloads (or several more string classes) to the Qt API, and less about unifying all implementations into a single set of algorithms. [I'm replying to Volker, but this should be read as replying to everyone, and 'you' should be read as the plural form] There's a bonus for documentability, of course, by using templates: one template vs. 64 explicit overloads. I hasten to add that the 64 is counting *this, so we're back to 16 for documentation purposes, because no-one is proposing to remove the member functions and only provide the free functions that back them, and that it's harder to document what a template accepts than it is to document 16 overloads, now that we can have multiple \fn per qdoc comment block. But that doesn't reduce the number of overloads. That template will be instantiated 16 times (and more, as it's hard to ignore const/non-const without forcing a copy, and even with a copy, the template function doesn't do implicit conversions the way an ordinary function would). Those instantiations are functions. Inline ones, hopefully, but nonetheless functions. It will not help compile-times, and it will degrade the error messages from the compiler, even if we (as we should) constrain the template. As an example of what all of this means, look at https://codereview.qt-project.org/c/qt/qtbase/+/181620, which is doing exactly that: make a former non-template a template function. Not even Thiago is sure it won't break code, and while I'd like to stand in front of you and claim that I designed it so that there _is_ no difference, in practice I wouldn't bet that some obscure compiler (like MSVC or the Integrity one) won't throw logs^Wtrunks in my way by the time I hit submit. Or look at QStringView ctors. It's a bit harder than it needs to be, because QStringView can't depend on QString in-size (because QString does on QStringView), but you're basically asking to make every string class member function that takes another string a mixture of QString::arg() as proposed in 181620 and current QStringView construction. Besides, as we all know, you can't partially-specialise function templates, so if you write 'specialise' what you're saying is either 'overload' or 'add a template struct with static members, partially specialise the struct' (iow: overloads). I hope this convinces everyone to finally closes the lid on the box labelled 'use templates and everything will be oh so easy'. Will we (have to) use templates? Yes. Will it reduce the number of overloads? Only if you want to inflict pain on your users. If you're still not convinced, here's QStringView::endsWith() as a template: template requires std::is_convertible_vQtf8StringView, || ... QLatin1StringView ... Q_ALWAYS_INLINE bool endsWith(Prefix ) const { return QtPrivate::endsWith(*this, QtPrivate::qStringLikeToStringView(p)); } with a qStringLikeToStringView() similar to the one in 181620. This uses C++20, and I'm sure it loses something over the current implementation. Qt::CaseSensitivity comes to mind. To anyone speaking up in favour of the box: Please write this in C++11 before you hit reply :) Thanks, Marc ___ Development mailing list
Re: [Development] char8_t summary?
> On 13 Jul 2019, at 13:41, Thiago Macieira wrote: > On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote: >> That said, I took a look at startsWith, and... surprise! It is *already >> a template*. So at least in that case, it isn't obvious why adding more >> combinations would be so terribly onerous. > > Again, note how the template implicitly assumes things. A 3-character string > cannot be present at the beginning (startsWith), end (endsWith) or anywhere > in > the middle (contains, indexOf, lastIndexOf) of a 2-character one, for example. > > But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 > string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 > character). That means implementing UTF-8 functions requires different > algorithms in the first place. That means templates are not usually the > answer. > > I'm not saying impossible. You can, by writing sufficiently generic > algorithms > that scan the strings in lockstep (you can scan UTF-8 backwards, after all). > But the reason you don't *want* to is that our Latin1 and UTF-16 algorithms > are optimised, often vectorised, for their purpose. We don't want to lose the > efficiency we've already got. > > And I'm not saying we shouldn't have UTF-8 algorithms or even a > QUtf8StringView or some such. It would have helped in CBOR, for example, see > QCborStreamWriter: >void appendTextString(const char *utf8, qsizetype len); > > This is one that should at least get the overload. > > -- > Thiago Macieira - thiago.macieira (AT) intel.com > Software Architect - Intel System Software Products As I understood the template suggestion, it’s more about not having to add 64 different overloads (or several more string classes) to the Qt API, and less about unifying all implementations into a single set of algorithms. With an (ideally) single template-based API we don’t have peopleusing Qt get lost in the jungle for overloads and string classes. For the implementation, we can specialise the templates to call the suitable internal functions that implement the various algorithms. I don’t know or claim that this is feasible, but that’s how I have interpeted the suggestion for a template-based solution, and generally the (valid, IMHO) complaint that we have by now a ton of classes in Qt that solve almost the same problem, and require a significant cognitive effort to chose correctly from. Cheers, Volker ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote: > That said, I took a look at startsWith, and... surprise! It is *already > a template*. So at least in that case, it isn't obvious why adding more > combinations would be so terribly onerous. Again, note how the template implicitly assumes things. A 3-character string cannot be present at the beginning (startsWith), end (endsWith) or anywhere in the middle (contains, indexOf, lastIndexOf) of a 2-character one, for example. But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 character). That means implementing UTF-8 functions requires different algorithms in the first place. That means templates are not usually the answer. I'm not saying impossible. You can, by writing sufficiently generic algorithms that scan the strings in lockstep (you can scan UTF-8 backwards, after all). But the reason you don't *want* to is that our Latin1 and UTF-16 algorithms are optimised, often vectorised, for their purpose. We don't want to lose the efficiency we've already got. And I'm not saying we shouldn't have UTF-8 algorithms or even a QUtf8StringView or some such. It would have helped in CBOR, for example, see QCborStreamWriter: void appendTextString(const char *utf8, qsizetype len); This is one that should at least get the overload. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Friday, 12 July 2019 12:27:58 -03 Matthew Woehlke wrote: > > And if we want to make use of the fact that a string > > is UTF-8, the templates won't work. > > Eh? char8_t is a detectable and distinct type. (Wasn't that the whole > point of this thread?) So is QUtf8String if such a thing were to come > into existence. I didn't mean we can't write templates. I meant that at the end of the implementation, you've got two distinct functions: one for Latin1/US-ASCII* and one for UTF-8, whether you used templates or not. So the template didn't buy you much. [*] US-ASCII under "out of range characters are UB", which allows us to simply use Latin1. Or UTF-8. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-12 22:37, Matthew Woehlke wrote: [...] So, perhaps you should suggest a more specific example? I did: replace and relational operators. And you're right to look at startsWith(), because that is indeed binary (*this being the first argument). And it's also one which is thoroughly view-enabled. But this just means that my replace() math was wrong: it's not binary, it's ternary (*this, before, after) and that means not 16 vs. 25 overloads, but 64 vs. 125 overloads. And that _is_ with views enabled (as per, QtPrivate::startsWith() (QChar arguments are handled one level up, and converted to a QStringView argument). And speaking about startsWith(): if you drill down through the templates, you will end up in qt_compre_strings, which is not templated, and even if it could be today, which would be rather pointless, you just drill one more level down and end up in ucstrncmp etc, which are oh so far away from ever being templates... So, as you can see, we're already using templates where it makes sense, but at some point you do need to go into the gritty details, and then it's assembler, not templates. Thanks, Marc ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 12/07/2019 16.05, Mutz, Marc via Development wrote: > On 2019-07-12 17:27, Matthew Woehlke wrote: >> On 11/07/2019 15.01, Thiago Macieira wrote: > [...] >>> Except that the whole point of those methods is that they can be more >>> efficient when the encoding is known and therefore templating won't >>> help. >> >> So those cases can employ specializations. Or, perhaps better, wrap the >> implementation bits where it matters in `if constexpr`. > > You should, maybe, take a look at qstring.cpp before you make such > uninformed statements. I was thinking in terms of what I would do if I was implementing things from scratch; not how I would refactor existing code. That said, I took a look at startsWith, and... surprise! It is *already a template*. So at least in that case, it isn't obvious why adding more combinations would be so terribly onerous. For that matter, making it a template (with explicit extern instantiations) would already be an improvement since it would cut down the several extant definitions into one definition and some declarations (which could even be enumerated by macro magic). So, perhaps you should suggest a more specific example? -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-12 17:27, Matthew Woehlke wrote: On 11/07/2019 15.01, Thiago Macieira wrote: [...] Except that the whole point of those methods is that they can be more efficient when the encoding is known and therefore templating won't help. So those cases can employ specializations. Or, perhaps better, wrap the implementation bits where it matters in `if constexpr`. You should, maybe, take a look at qstring.cpp before you make such uninformed statements. When you do, keep in mind that these 12k5loc do not even contain direct (as in zerocopy) utf-8/l1 and utf-8/utf16 comparisons, yet. Optimizing those is what earns you a slot at CppCon. Well, not anymore, that ship has sailed. Thanks, Marc ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 11/07/2019 15.01, Thiago Macieira wrote: > On Thursday, 11 July 2019 13:41:49 -03 Matthew Woehlke wrote: >> On 11/07/2019 05.05, Mutz, Marc via Development wrote: >>> There is a cost associated with another string class, too, and it's >>> combinatorial explosion. Even when we have all view types >>> (QLatin1StringView, QUtf8StringView, QStringView), consider the overload >>> set of QString::replace(), ignoring the (ptr, size) variants: >>> >>>{QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar} >>> >>> that's 16 overloads. And that's without a possible QUtf32StringView. >> >> So? >> >> The right way to handle this is for those methods to be templated, in >> which case a) the code only needs to be written O(1) times, not O(N) >> times, and b) users can potentially specialize for their own string >> types as well. > > Except that the whole point of those methods is that they can be more > efficient when the encoding is known and therefore templating won't help. So those cases can employ specializations. Or, perhaps better, wrap the implementation bits where it matters in `if constexpr`. > Templating won't make overload resolution any faster, but will make > compilation times slower. For Qt, yes. This could be significantly (entirely?) mitigated with explicit, external instantiations, such that only the one source in Qt itself that compiles the instantiations is significantly affected. > And if we want to make use of the fact that a string > is UTF-8, the templates won't work. Eh? char8_t is a detectable and distinct type. (Wasn't that the whole point of this thread?) So is QUtf8String if such a thing were to come into existence. >> If done cleverly, even the (pointer, size) variants should be able to >> wrap the arguments in a View, such that those method definitions are >> trivial. > > View = (pointer,size) pair. I meant that e.g. it would not be hard to make: foo(CharType const* s, SizeType L) ...be a simple wrapper around: foo(View::type s); ...which is itself either a template (per above), or several non-template functions taking various types of views (status quo). No combinatorial explosion of code per possible pointer type. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Thursday, 11 July 2019 13:41:49 -03 Matthew Woehlke wrote: > On 11/07/2019 05.05, Mutz, Marc via Development wrote: > > There is a cost associated with another string class, too, and it's > > combinatorial explosion. Even when we have all view types > > (QLatin1StringView, QUtf8StringView, QStringView), consider the overload > > set of QString::replace(), ignoring the (ptr, size) variants: > > > >{QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar} > > > > that's 16 overloads. And that's without a possible QUtf32StringView. > > So? > > The right way to handle this is for those methods to be templated, in > which case a) the code only needs to be written O(1) times, not O(N) > times, and b) users can potentially specialize for their own string > types as well. Except that the whole point of those methods is that they can be more efficient when the encoding is known and therefore templating won't help. Templating won't make overload resolution any faster, but will make compilation times slower. And if we want to make use of the fact that a string is UTF-8, the templates won't work. Right now, we know bytelength(latin1string) == codepointlength(utf16string), so we know how to efficiently replace and we apply that knowledge to indexOf, startsWith, endsWith, etc.. That's not the case for UTF-8, so algorithms will begin to differ very quickly. > If done cleverly, even the (pointer, size) variants should be able to > wrap the arguments in a View, such that those method definitions are > trivial. View = (pointer,size) pair. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
> please, if it can be avoided, don't add yet another string-related class to > Qt. Knowing > when to properly use QString, QByteArray, QLatin1String, QStringLiteral, > QStringRef and > QStringView (I may have missed a few) is already a challenge. And I imagine > for people > new to Qt it can even be a strong deterrent (after all, strings are something > you tend > to use even in a simple Hello World - the first app most people see or write > in a new > language/ framework). I totally agree. Maybe this helps (I could not find such a document): https://bugreports.qt.io/browse/QTBUG-77020 -- Best Regards, Bernhard Lindner ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Thu, 11 Jul 2019 at 18:43, Matthew Woehlke wrote: > On 11/07/2019 05.05, Mutz, Marc via Development wrote: > > There is a cost associated with another string class, too, and it's > > combinatorial explosion. Even when we have all view types > > (QLatin1StringView, QUtf8StringView, QStringView), consider the overload > > set of QString::replace(), ignoring the (ptr, size) variants: > > > >{QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar} > > > > that's 16 overloads. And that's without a possible QUtf32StringView. > > So? > > I have nothing to say in this discussion, but just want to throw in one small hint/request/worry: please, if it can be avoided, don't add yet another string-related class to Qt. Knowing when to properly use QString, QByteArray, QLatin1String, QStringLiteral, QStringRef and QStringView (I may have missed a few) is already a challenge. And I imagine for people new to Qt it can even be a strong deterrent (after all, strings are something you tend to use even in a simple Hello World - the first app most people see or write in a new language/ framework). > The right way to handle this is for those methods to be templated, in > which case a) the code only needs to be written O(1) times, not O(N) > times, and b) users can potentially specialize for their own string > types as well. > > If done cleverly, even the (pointer, size) variants should be able to > wrap the arguments in a View, such that those method definitions are > trivial. > > -- > Matthew > ___ > Development mailing list > Development@qt-project.org > https://lists.qt-project.org/listinfo/development > ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 11/07/2019 05.05, Mutz, Marc via Development wrote: > There is a cost associated with another string class, too, and it's > combinatorial explosion. Even when we have all view types > (QLatin1StringView, QUtf8StringView, QStringView), consider the overload > set of QString::replace(), ignoring the (ptr, size) variants: > > {QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar} > > that's 16 overloads. And that's without a possible QUtf32StringView. So? The right way to handle this is for those methods to be templated, in which case a) the code only needs to be written O(1) times, not O(N) times, and b) users can potentially specialize for their own string types as well. If done cleverly, even the (pointer, size) variants should be able to wrap the arguments in a View, such that those method definitions are trivial. -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-11 10:13, André Pönitz wrote: On Wed, Jul 10, 2019 at 10:01:04PM -0300, Thiago Macieira wrote: On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote: > As far as I understand there's a perceived need to have "full" utf8 > literals, and there's a need to have ASCII literals. First could be > served by some QUtf8*, second by QAscii*, both additions, no need to > change QLatin* semantics. ASCII = Latin1 bool = char ? circle = ellipse ? It's a subset, it is special enough to be called by its name. Especially if it has features (e.g. toUpper/toLower operating on single letters) that are not present in the larger set. The line of discussion here is - people (correctly, happily) use toUpper on (7-bit clean US-ASCII) data - ASCII is claimed to be identical to Latin1 - since it is identical it is superfluous to have both and ASCII is dropped - toUpper does not work per-char for Latin1 in corner cases - so it needs to be dropped "to avoid wrong use" There is a cost associated with another string class, too, and it's combinatorial explosion. Even when we have all view types (QLatin1StringView, QUtf8StringView, QStringView), consider the overload set of QString::replace(), ignoring the (ptr, size) variants: {QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar} that's 16 overloads. And that's without a possible QUtf32StringView. Ditto for the relational operators. Add QAsciiStringView and you're up to 25. Mind you, this is the math for the end game: no more const char*, const char8_t*, and (ptr, size) overloads as they've all been subsumed by their corresponding views. We'll be there, maybe, come Qt 7. The math is even worse until then. In the end this deprives users from a useful tool in a scenario where it was perfectly fine to use. I don't see how. Users will be able to use QU8V or QL1V's toUppper() and they'll just work for US-ASCII. The L1 algorithm can be coded such that only ß and \xFF are on a slow path. Or maybe it's the case that toUpper() doesn't extend the length of UTF-8-encoded text? Maybe we're lucky and Unicode finally gets that the capital letter ß isn't SS, but ẞ, and we can then just document that if the capital letter isn't representable in L1, then it stays unchanged. I'm still not convinced that QAsciiString is needed for any of this. Thanks, Marc ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Wed, Jul 10, 2019 at 10:01:04PM -0300, Thiago Macieira wrote: > On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote: > > As far as I understand there's a perceived need to have "full" utf8 > > literals, and there's a need to have ASCII literals. First could be > > served by some QUtf8*, second by QAscii*, both additions, no need to > > change QLatin* semantics. > > ASCII = Latin1 bool = char ? circle = ellipse ? It's a subset, it is special enough to be called by its name. Especially if it has features (e.g. toUpper/toLower operating on single letters) that are not present in the larger set. The line of discussion here is - people (correctly, happily) use toUpper on (7-bit clean US-ASCII) data - ASCII is claimed to be identical to Latin1 - since it is identical it is superfluous to have both and ASCII is dropped - toUpper does not work per-char for Latin1 in corner cases - so it needs to be dropped "to avoid wrong use" In the end this deprives users from a useful tool in a scenario where it was perfectly fine to use. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Wednesday, 10 July 2019 22:01:04 -03 Thiago Macieira wrote: > On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote: > > As far as I understand there's a perceived need to have "full" utf8 > > literals, and there's a need to have ASCII literals. First could be > > served by some QUtf8*, second by QAscii*, both additions, no need to > > change QLatin* semantics. > > ASCII = Latin1 In the sense that the class holding ASCII should be the Latin1 class, for the reasons that Marc presented. It's actually faster to convert from Latin1 to UTF-16 than from US-ASCII to UTF-16 (unless we declare out-of-bounds US-ASCII UB). The only issue is what to do with the transforming functions toUpper and toLower. -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote: > As far as I understand there's a perceived need to have "full" utf8 > literals, and there's a need to have ASCII literals. First could be > served by some QUtf8*, second by QAscii*, both additions, no need to > change QLatin* semantics. ASCII = Latin1 -- Thiago Macieira - thiago.macieira (AT) intel.com Software Architect - Intel System Software Products ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 10/07/2019 09.10, Mutz, Marc via Development wrote: > The other reason is about error checking: What should the result be of > putting an æ into a QAsciiString? Assert at runtime? UB? In > QLatin1String, this error just can't happen. Even if you feed it UTF-8, > you may get mojibake, because you picked the wrong encoding, but it's > not an error. Any UTF-8 octet sequence is a valid L1 string. > > So, I don't see QAscii* pulling it's weight. The reason ASCII might be helpful is that it guarantees certain transformations (e.g. case conversion) in-place. L1 can't do this; the L1 upper-case of U+00DF ('ß') is "SS". U+00FF ('ÿ') is in a similar boat; I'm not sure it *has* an L1 upper-case. (The "proper" upper-case is, I presume, U+0178, which is not in L1.) Also, conversion from ASCII to either L1 or UTF-8 is a no-op. (ASCII to UTF-16 can also be done with strict widening, but that's true for L1 also.) -- Matthew ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-10 14:55, André Pönitz wrote: On Wed, Jul 10, 2019 at 11:29:15AM +0200, Mutz, Marc via Development wrote: On 2019-07-10 10:50, Arnaud Clere wrote: > Hi all, > > So, do I understand correctly that: > 1. QUtf8String may be required in Qt7 to solve problems due to C++2x > char8_t I wouldn't say required. I also don't think it needs to wait until Qt 7. Qt 7 is where we may depend on C++20 and can use char8_t in the interface and implementation, but we should certainly not wait for that to add the class. It's certainly a good idea, IMO, to have views and owning containers that operate on L1, UTF-8 and UTF-16 strings. The views are more important. > 2. QByteArray methods currently operating on latin1 may be restricted > to ascii in Qt6 to avoid problems when const char* input really is > utf8 I have no opinion on that. > 3. QLatin1String may become QLatin1StringView by Qt7 Qt 6. We can add the name as an alias now, make QLatin1String an owning container for Qt 6.0 (it breaks no code, just makes it slower, and the port is trivial), and QLatin1StringView becomes what QLatin1String is now. As far as I understand there's a perceived need to have "full" utf8 literals, and there's a need to have ASCII literals. First could be served by some QUtf8*, second by QAscii*, both additions, no need to change QLatin* semantics. L1 is special because it's the first plane of Unicode, so conversion between the two will always be faster than between other encodings. This is why it makes sense to use all 8 bits and have L1, not artificially restrict to US-ASCII strings. That's one reason: opportunism. The other reason is about error checking: What should the result be of putting an æ into a QAsciiString? Assert at runtime? UB? In QLatin1String, this error just can't happen. Even if you feed it UTF-8, you may get mojibake, because you picked the wrong encoding, but it's not an error. Any UTF-8 octet sequence is a valid L1 string. So, I don't see QAscii* pulling it's weight. Thanks, Marc ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On Wed, Jul 10, 2019 at 11:29:15AM +0200, Mutz, Marc via Development wrote: > On 2019-07-10 10:50, Arnaud Clere wrote: > > Hi all, > > > > So, do I understand correctly that: > > 1. QUtf8String may be required in Qt7 to solve problems due to C++2x > > char8_t > > I wouldn't say required. I also don't think it needs to wait until Qt 7. Qt > 7 is where we may depend on C++20 and can use char8_t in the interface and > implementation, but we should certainly not wait for that to add the class. > It's certainly a good idea, IMO, to have views and owning containers that > operate on L1, UTF-8 and UTF-16 strings. The views are more important. > > > 2. QByteArray methods currently operating on latin1 may be restricted > > to ascii in Qt6 to avoid problems when const char* input really is > > utf8 > > I have no opinion on that. > > > 3. QLatin1String may become QLatin1StringView by Qt7 > > Qt 6. We can add the name as an alias now, make QLatin1String an owning > container for Qt 6.0 (it breaks no code, just makes it slower, and the port > is trivial), and QLatin1StringView becomes what QLatin1String is now. As far as I understand there's a perceived need to have "full" utf8 literals, and there's a need to have ASCII literals. First could be served by some QUtf8*, second by QAscii*, both additions, no need to change QLatin* semantics. Andre' ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
On 2019-07-10 10:50, Arnaud Clere wrote: Hi all, So, do I understand correctly that: 1. QUtf8String may be required in Qt7 to solve problems due to C++2x char8_t I wouldn't say required. I also don't think it needs to wait until Qt 7. Qt 7 is where we may depend on C++20 and can use char8_t in the interface and implementation, but we should certainly not wait for that to add the class. It's certainly a good idea, IMO, to have views and owning containers that operate on L1, UTF-8 and UTF-16 strings. The views are more important. 2. QByteArray methods currently operating on latin1 may be restricted to ascii in Qt6 to avoid problems when const char* input really is utf8 I have no opinion on that. 3. QLatin1String may become QLatin1StringView by Qt7 Qt 6. We can add the name as an alias now, make QLatin1String an owning container for Qt 6.0 (it breaks no code, just makes it slower, and the port is trivial), and QLatin1StringView becomes what QLatin1String is now. 4. These classes will be independent except maybe for a common internal class Yes. Or separate instantiations of the same class template. They also should convert to QByteArray. Just not by public inheritance. Thannks, Marc ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] char8_t summary?
Hi all, So, do I understand correctly that: 1. QUtf8String may be required in Qt7 to solve problems due to C++2x char8_t 2. QByteArray methods currently operating on latin1 may be restricted to ascii in Qt6 to avoid problems when const char* input really is utf8 3. QLatin1String may become QLatin1StringView by Qt7 4. These classes will be independent except maybe for a common internal class Arnaud ___ Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development