Re: [Development] char8_t summary?

2019-07-16 Thread Thiago Macieira
On Tuesday, 16 July 2019 09:11:37 PDT Matthew Woehlke wrote:
> On 15/07/2019 18.19, Thiago Macieira wrote:
> > On Monday, 15 July 2019 09:41:24 PDT Matthew Woehlke wrote:
> >> Note also that I suggested having the template definition out-of-line;
> >> it doesn't need to be in (e.g.) qstring.h or anywhere that will affect
> >> *user* compile times. Only the TU responsible for instantiating them
> >> would be affected, and that should be negligible in the grand scheme of
> >> things.
> > 
> > Then it's no different than an overload, if the implementation isn't the
> > same (and it isn't).
> 
> ...but a template allows the common portions to be written in a single
> definition with overloads *and/or* `if constexpr` used where the code
> needs to differ. Regular overloads would require 100% of the definition
> to be duplicated for each overload.

And what Marc and I are arguing is that the common portions are small enough 
not to be worth the hassle of a template in the first place.

> Concrete example:
> 
>   // .h
>   bool contains(QStringView);
>   bool contains(QLatin1StringView);
[cut]

Two things:
1) templatisation of contains, indexOf, startsWith, etc. is already being done 
in dev

2) the work being done *and* your example are UTF-16 and Latin1 only. The 
whole issue here is that *UTF-8* will not share enough code.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-16 Thread Mutz, Marc via Development

On 2019-07-16 18:11, Matthew Woehlke wrote:
[...]

The basic
algorithm (iterate through 'haystack' looking for 'needle') is common
regardless of the string types. The points that differ (e.g. only
starting the search at code points, computing lengths) use overloaded
helper functions which can be inline (e.g. q_next_codepoint for some
types will just be operator++) and optimized.


Please square me that with this comment from qstring.cpp:


// we're going to read a[0..15] and b[0..15] (32 bytes)


Thanks,
Marc
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-16 Thread Matthew Woehlke
On 15/07/2019 18.19, Thiago Macieira wrote:
> On Monday, 15 July 2019 09:41:24 PDT Matthew Woehlke wrote:
>> Note also that I suggested having the template definition out-of-line;
>> it doesn't need to be in (e.g.) qstring.h or anywhere that will affect
>> *user* compile times. Only the TU responsible for instantiating them
>> would be affected, and that should be negligible in the grand scheme of
>> things.
> 
> Then it's no different than an overload, if the implementation isn't the same 
> (and it isn't).

...but a template allows the common portions to be written in a single
definition with overloads *and/or* `if constexpr` used where the code
needs to differ. Regular overloads would require 100% of the definition
to be duplicated for each overload.

In terms of *declarations*, yes, you are going to have the same number
of declarations. However, those are only one line, and potentially can
be generated for each string type using a macro, so O(M+N) (M = methods,
N = string types) rather than O(M*N) actual source lines. (Granted, you
could do this for plain overload declarations also, but a) this probably
doesn't play as well with documentation, and b) you still have to write
O(M*N) definitions rather than O(M).)

Concrete example:

  // .h
  bool contains(QStringView);
  bool contains(QLatin1StringView);

  // .cpp
  bool contains(QStringView needle)
  {
...
  }

  bool contains(QStringView needle)
  {
...
  }

- vs -

  // .h
  template 
  bool contains(T);
  extern template bool contains(QStringView);
  extern template bool contains(QLatin1StringView);

  // .cpp
  template 
  bool contains(T needle)
  {
int const l = needle.chars();
int i = 0;
... // computation of went_too_far elided
while (i < went_too_far)
{
  if (q_compare_strings(this->midRef(i), needle, l)
return true;
  i = q_next_codepoint(this, i);
}
return false;
  }

  template bool contains(QStringView);
  template bool contains(QLatin1StringView);

Keep in mind also that this method lives in a notional (templated?)
QGenericString base class and/or is actually a helper function, i.e. it
is also templated on the type of this/haystack... thus I have this *one
and only one* definition of 'contains', rather than O(N²) definitions.

Hopefully this presents a plausible example of common code. The basic
algorithm (iterate through 'haystack' looking for 'needle') is common
regardless of the string types. The points that differ (e.g. only
starting the search at code points, computing lengths) use overloaded
helper functions which can be inline (e.g. q_next_codepoint for some
types will just be operator++) and optimized. It's also likely that
these helpers will be used in multiple methods.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-15 Thread Thiago Macieira
On Monday, 15 July 2019 09:41:24 PDT Matthew Woehlke wrote:
> Note also that I suggested having the template definition out-of-line;
> it doesn't need to be in (e.g.) qstring.h or anywhere that will affect
> *user* compile times. Only the TU responsible for instantiating them
> would be affected, and that should be negligible in the grand scheme of
> things.

Then it's no different than an overload, if the implementation isn't the same 
(and it isn't).

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-15 Thread Matthew Woehlke
On 14/07/2019 02.28, Mutz, Marc via Development wrote:
> If you're still not convinced, here's QStringView::endsWith() as a
> template:
> 
>    template 
>    requires std::is_convertible_v Qtf8StringView, || ... QLatin1StringView  ...
>    Q_ALWAYS_INLINE
>    bool endsWith(Prefix ) const {
>    return QtPrivate::endsWith(*this,
> QtPrivate::qStringLikeToStringView(p));
>    }
> 
> with a qStringLikeToStringView() similar to the one in 181620. This uses
> C++20, and I'm sure it loses something over the current implementation.
> Qt::CaseSensitivity comes to mind.

...and I don't know why you didn't just propagate through the case
sensitivity argument?

> To anyone speaking up in favour of
> the box: Please write this in C++11 before you hit reply :)

IIUC, replacing the `requires` is trivial. A bit ugly, sure, but not
difficult.

I also question the value of the indirection in the above. Moving the
implementation of QtPrivate::endsWith to be inline, and making use of
`if constexpr` where useful, will hopefully reduce the total amount of
code. (Yes, eventually you're going to have an optimized string
comparison. Helper code like that to implement the critical code paths
will still exist, but hopefully those are bits that get used over and
over in many methods.)

Note also that I suggested having the template definition out-of-line;
it doesn't need to be in (e.g.) qstring.h or anywhere that will affect
*user* compile times. Only the TU responsible for instantiating them
would be affected, and that should be negligible in the grand scheme of
things.

BTW, I don't think ternary functions are an issue. The ones that come to
mind will "always"¹ need to convert one of their arguments anyway, so
while the *templates* may involve another level of combinatorics, that
level won't affect the implementation complexity in any meaningful way.

(¹ Possibly they can skip this because that argument is never actually
used, but otherwise it must be converted.)

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-15 Thread Giuseppe D'Angelo via Development

On 13/07/2019 21:39, Volker Hilsheimer wrote:

With an (ideally) single template-based API we don’t have peopleusing Qt get 
lost in the jungle for overloads and string classes. For the implementation, we 
can specialise the templates to call the suitable internal functions that 
implement the various algorithms.


This is basically a Qt 7 idea, raised some time ago: a string class that 
is a collection of code points under "some" Unicode encoding, 
transparently wrapping UTF-8 / 16 / 32 sequences without extra copies. 
Functions between these strings dispatch to the right overload, in a 
manner that is totally invisible for the user. Similarly, the high level 
API works in terms of code points, not units.


Only if one wants to get the hands dirty then one can query and extract 
the actual encoded data.


My 2 c,
--
Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer
KDAB (France) S.A.S., a KDAB Group company
Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com
KDAB - The Qt, C++ and OpenGL Experts



smime.p7s
Description: S/MIME Cryptographic Signature
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-15 Thread Matthew Woehlke
On 13/07/2019 15.39, Volker Hilsheimer wrote:
> As I understood the template suggestion, it’s more about not having 
> to add 64 different overloads (or several more string classes) to
> the Qt API, and less about unifying all implementations into a single
> set of algorithms.

Right. At some point you are going to call out to specialized functions
(e.g. qt_compre_strings as Marc mentioned). The thought was to have a
(more modest) set of these specialized helpers with the generic bits
implemented as template logic. Probably with a bunch of `if constexpr`
branches to perform optimizations when possible.

> On 13/07/2019 07.41, Thiago Macieira wrote:> Again, note how the template 
> implicitly assumes things. A 3-character string 
>> cannot be present at the beginning (startsWith), end (endsWith) or anywhere 
>> in 
>> the middle (contains, indexOf, lastIndexOf) of a 2-character one, for 
>> example.
>> 
>> But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 
>> string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 
>> character).

The correct fix for that is to count code points, not characters.
Possibly this means that such optimization should be behind an 'if
constexpr' to only use it when it is safe to do so.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-14 Thread André Pönitz
On Sun, Jul 14, 2019 at 08:28:58AM +0200, Mutz, Marc via Development wrote:
> > As I understood the template suggestion, it’s more about not having to
> > add 64 different overloads (or several more string classes) to the Qt
> > API, and less about unifying all implementations into a single set of
> > algorithms.
> 
> [I'm replying to Volker, but this should be read as replying to everyone,
> and 'you' should be read as the plural form]

Thanks for this clarification. It really helps.
 
> [...]
> But that doesn't reduce the number of overloads.

Has having this thin wrapper around the "usual suspects" of string-like
arguments as normal case of argument passing been considered?

This could be something like a string view with a few bits spent on
encoding infomation

This effectively eats the "free" implicit type conversion when passing an
argument but history (introduction of QStringBuilder) has shown that while
not completely source compatible, it was fairly harmless.

QStringBuilder itself already uses up the free conversion, but could
get an operator xxx() to produce the new argument tyoe, even can keep 
track of the encodings of the parts to help with provide the right
encoding bits.


>template 
>requires std::is_convertible_v Qtf8StringView, || ... QLatin1StringView  ...
>Q_ALWAYS_INLINE
>bool endsWith(Prefix ) const {
>return QtPrivate::endsWith(*this,
> QtPrivate::qStringLikeToStringView(p));
>}
> 
> with a qStringLikeToStringView() similar to the one in 181620.

The looks kind of related, just that the qStringLikeToStringView() should
not need to be end up explicitly written multiple times on the receiver
side, but be done implicitly in the conversion of the arguments in the
function call.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-14 Thread Mutz, Marc via Development

On 2019-07-13 21:39, Volker Hilsheimer wrote:
On 13 Jul 2019, at 13:41, Thiago Macieira  
wrote:

On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote:
That said, I took a look at startsWith, and... surprise! It is 
*already
a template*. So at least in that case, it isn't obvious why adding 
more

combinations would be so terribly onerous.


Again, note how the template implicitly assumes things. A 3-character 
string
cannot be present at the beginning (startsWith), end (endsWith) or 
anywhere in
the middle (contains, indexOf, lastIndexOf) of a 2-character one, for 
example.


But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character 
UTF-16
string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit 
UTF-16 (1

character). That means implementing UTF-8 functions requires different
algorithms in the first place. That means templates are not usually 
the

answer.

I'm not saying impossible. You can, by writing sufficiently generic 
algorithms
that scan the strings in lockstep (you can scan UTF-8 backwards, after 
all).
But the reason you don't *want* to is that our Latin1 and UTF-16 
algorithms
are optimised, often vectorised, for their purpose. We don't want to 
lose the

efficiency we've already got.

And I'm not saying we shouldn't have UTF-8 algorithms or even a
QUtf8StringView or some such. It would have helped in CBOR, for 
example, see

QCborStreamWriter:
   void appendTextString(const char *utf8, qsizetype len);

This is one that should at least get the overload.

--
Thiago Macieira - thiago.macieira (AT) intel.com
 Software Architect - Intel System Software Products



As I understood the template suggestion, it’s more about not having to
add 64 different overloads (or several more string classes) to the Qt
API, and less about unifying all implementations into a single set of
algorithms.


[I'm replying to Volker, but this should be read as replying to 
everyone, and 'you' should be read as the plural form]


There's a bonus for documentability, of course, by using templates: one 
template vs. 64 explicit overloads. I hasten to add that the 64 is 
counting *this, so we're back to 16 for documentation purposes, because 
no-one is proposing to remove the member functions and only provide the 
free functions that back them, and that it's harder to document what a 
template accepts than it is to document 16 overloads, now that we can 
have multiple \fn per qdoc comment block.


But that doesn't reduce the number of overloads. That template will be 
instantiated 16 times (and more, as it's hard to ignore const/non-const 
without forcing a copy, and even with a copy, the template function 
doesn't do implicit conversions the way an ordinary function would). 
Those instantiations are functions. Inline ones, hopefully, but 
nonetheless functions. It will not help compile-times, and it will 
degrade the error messages from the compiler, even if we (as we should) 
constrain the template.


As an example of what all of this means, look at 
https://codereview.qt-project.org/c/qt/qtbase/+/181620, which is doing 
exactly that: make a former non-template a template function. Not even 
Thiago is sure it won't break code, and while I'd like to stand in front 
of you and claim that I designed it so that there _is_ no difference, in 
practice I wouldn't bet that some obscure compiler (like MSVC or the 
Integrity one) won't throw logs^Wtrunks in my way by the time I hit 
submit. Or look at QStringView ctors. It's a bit harder than it needs to 
be, because QStringView can't depend on QString in-size (because QString 
does on QStringView), but you're basically asking to make every string 
class member function that takes another string a mixture of 
QString::arg() as proposed in 181620 and current QStringView 
construction.


Besides, as we all know, you can't partially-specialise function 
templates, so if you write 'specialise' what you're saying is either 
'overload' or 'add a template struct with static members, partially 
specialise the struct' (iow: overloads).


I hope this convinces everyone to finally closes the lid on the box 
labelled 'use templates and everything will be oh so easy'.


Will we (have to) use templates? Yes. Will it reduce the number of 
overloads? Only if you want to inflict pain on your users.


If you're still not convinced, here's QStringView::endsWith() as a 
template:


   template 
   requires std::is_convertible_vQtf8StringView, || ... QLatin1StringView  ...

   Q_ALWAYS_INLINE
   bool endsWith(Prefix ) const {
   return QtPrivate::endsWith(*this, 
QtPrivate::qStringLikeToStringView(p));

   }

with a qStringLikeToStringView() similar to the one in 181620. This uses 
C++20, and I'm sure it loses something over the current implementation. 
Qt::CaseSensitivity comes to mind. To anyone speaking up in favour of 
the box: Please write this in C++11 before you hit reply :)


Thanks,
Marc
___
Development mailing list

Re: [Development] char8_t summary?

2019-07-13 Thread Volker Hilsheimer
> On 13 Jul 2019, at 13:41, Thiago Macieira  wrote:
> On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote:
>> That said, I took a look at startsWith, and... surprise! It is *already
>> a template*. So at least in that case, it isn't obvious why adding more
>> combinations would be so terribly onerous.
> 
> Again, note how the template implicitly assumes things. A 3-character string 
> cannot be present at the beginning (startsWith), end (endsWith) or anywhere 
> in 
> the middle (contains, indexOf, lastIndexOf) of a 2-character one, for example.
> 
> But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 
> string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 
> character). That means implementing UTF-8 functions requires different 
> algorithms in the first place. That means templates are not usually the 
> answer.
> 
> I'm not saying impossible. You can, by writing sufficiently generic 
> algorithms 
> that scan the strings in lockstep (you can scan UTF-8 backwards, after all). 
> But the reason you don't *want* to is that our Latin1 and UTF-16 algorithms 
> are optimised, often vectorised, for their purpose. We don't want to lose the 
> efficiency we've already got.
> 
> And I'm not saying we shouldn't have UTF-8 algorithms or even a 
> QUtf8StringView or some such. It would have helped in CBOR, for example, see 
> QCborStreamWriter:
>void appendTextString(const char *utf8, qsizetype len);
> 
> This is one that should at least get the overload.
> 
> -- 
> Thiago Macieira - thiago.macieira (AT) intel.com
>  Software Architect - Intel System Software Products


As I understood the template suggestion, it’s more about not having to add 64 
different overloads (or several more string classes) to the Qt API, and less 
about unifying all implementations into a single set of algorithms.

With an (ideally) single template-based API we don’t have peopleusing Qt get 
lost in the jungle for overloads and string classes. For the implementation, we 
can specialise the templates to call the suitable internal functions that 
implement the various algorithms.

I don’t know or claim that this is feasible, but that’s how I have interpeted 
the suggestion for a template-based solution, and generally the (valid, IMHO) 
complaint that we have by now a ton of classes in Qt that solve almost the same 
problem, and require a significant cognitive effort to chose correctly from.

Cheers,
Volker

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-13 Thread Thiago Macieira
On Friday, 12 July 2019 17:37:59 -03 Matthew Woehlke wrote:
> That said, I took a look at startsWith, and... surprise! It is *already
> a template*. So at least in that case, it isn't obvious why adding more
> combinations would be so terribly onerous.

Again, note how the template implicitly assumes things. A 3-character string 
cannot be present at the beginning (startsWith), end (endsWith) or anywhere in 
the middle (contains, indexOf, lastIndexOf) of a 2-character one, for example.

But a 2- and 3-byte UTF-8 string can be the prefix of a 1-character UTF-16 
string and a 4-byte UTF-8 string can be the prefix of a 2-codeunit UTF-16 (1 
character). That means implementing UTF-8 functions requires different 
algorithms in the first place. That means templates are not usually the 
answer.

I'm not saying impossible. You can, by writing sufficiently generic algorithms 
that scan the strings in lockstep (you can scan UTF-8 backwards, after all). 
But the reason you don't *want* to is that our Latin1 and UTF-16 algorithms 
are optimised, often vectorised, for their purpose. We don't want to lose the 
efficiency we've already got.

And I'm not saying we shouldn't have UTF-8 algorithms or even a 
QUtf8StringView or some such. It would have helped in CBOR, for example, see 
QCborStreamWriter:
void appendTextString(const char *utf8, qsizetype len);

This is one that should at least get the overload.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-13 Thread Thiago Macieira
On Friday, 12 July 2019 12:27:58 -03 Matthew Woehlke wrote:
> > And if we want to make use of the fact that a string
> > is UTF-8, the templates won't work.
> 
> Eh? char8_t is a detectable and distinct type. (Wasn't that the whole
> point of this thread?) So is QUtf8String if such a thing were to come
> into existence.

I didn't mean we can't write templates.

I meant that at the end of the implementation, you've got two distinct 
functions: one for Latin1/US-ASCII* and one for UTF-8, whether you used 
templates or not. So the template didn't buy you much.

[*] US-ASCII under "out of range characters are UB", which allows us to simply 
use Latin1. Or UTF-8.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-13 Thread Mutz, Marc via Development

On 2019-07-12 22:37, Matthew Woehlke wrote:
[...]

So, perhaps you should suggest a more specific example?


I did: replace and relational operators. And you're right to look at 
startsWith(), because that is indeed binary (*this being the first 
argument). And it's also one which is thoroughly view-enabled. But this 
just means that my replace() math was wrong: it's not binary, it's 
ternary (*this, before, after) and that means not 16 vs. 25 overloads, 
but 64 vs. 125 overloads. And that _is_ with views enabled (as per, 
QtPrivate::startsWith() (QChar arguments are handled one level up, and 
converted to a QStringView argument).


And speaking about startsWith(): if you drill down through the 
templates, you will end up in qt_compre_strings, which is not templated, 
and even if it could be today, which would be rather pointless, you just 
drill one more level down and end up in ucstrncmp etc, which are oh so 
far away from ever being templates...


So, as you can see, we're already using templates where it makes sense, 
but at some point you do need to go into the gritty details, and then 
it's assembler, not templates.


Thanks,
Marc
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-12 Thread Matthew Woehlke
On 12/07/2019 16.05, Mutz, Marc via Development wrote:
> On 2019-07-12 17:27, Matthew Woehlke wrote:
>> On 11/07/2019 15.01, Thiago Macieira wrote:
> [...]
>>> Except that the whole point of those methods is that they can be more
>>> efficient when the encoding is known and therefore templating won't
>>> help.
>>
>> So those cases can employ specializations. Or, perhaps better, wrap the
>> implementation bits where it matters in `if constexpr`.
> 
> You should, maybe, take a look at qstring.cpp before you make such
> uninformed statements.

I was thinking in terms of what I would do if I was implementing things
from scratch; not how I would refactor existing code.

That said, I took a look at startsWith, and... surprise! It is *already
a template*. So at least in that case, it isn't obvious why adding more
combinations would be so terribly onerous.

For that matter, making it a template (with explicit extern
instantiations) would already be an improvement since it would cut down
the several extant definitions into one definition and some declarations
(which could even be enumerated by macro magic).

So, perhaps you should suggest a more specific example?

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-12 Thread Mutz, Marc via Development

On 2019-07-12 17:27, Matthew Woehlke wrote:

On 11/07/2019 15.01, Thiago Macieira wrote:

[...]

Except that the whole point of those methods is that they can be more
efficient when the encoding is known and therefore templating won't 
help.


So those cases can employ specializations. Or, perhaps better, wrap the
implementation bits where it matters in `if constexpr`.


You should, maybe, take a look at qstring.cpp before you make such 
uninformed statements.


When you do, keep in mind that these 12k5loc do not even contain direct 
(as in zerocopy) utf-8/l1 and utf-8/utf16 comparisons, yet. Optimizing 
those is what earns you a slot at CppCon. Well, not anymore, that ship 
has sailed.


Thanks,
Marc
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-12 Thread Matthew Woehlke
On 11/07/2019 15.01, Thiago Macieira wrote:
> On Thursday, 11 July 2019 13:41:49 -03 Matthew Woehlke wrote:
>> On 11/07/2019 05.05, Mutz, Marc via Development wrote:
>>> There is a cost associated with another string class, too, and it's
>>> combinatorial explosion. Even when we have all view types
>>> (QLatin1StringView, QUtf8StringView, QStringView), consider the overload
>>> set of QString::replace(), ignoring the (ptr, size) variants:
>>>
>>>{QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar}
>>>
>>> that's 16 overloads. And that's without a possible QUtf32StringView.
>>
>> So?
>>
>> The right way to handle this is for those methods to be templated, in
>> which case a) the code only needs to be written O(1) times, not O(N)
>> times, and b) users can potentially specialize for their own string
>> types as well.
> 
> Except that the whole point of those methods is that they can be more 
> efficient when the encoding is known and therefore templating won't help. 

So those cases can employ specializations. Or, perhaps better, wrap the
implementation bits where it matters in `if constexpr`.

> Templating won't make overload resolution any faster, but will make 
> compilation times slower.

For Qt, yes. This could be significantly (entirely?) mitigated with
explicit, external instantiations, such that only the one source in Qt
itself that compiles the instantiations is significantly affected.

> And if we want to make use of the fact that a string 
> is UTF-8, the templates won't work.

Eh? char8_t is a detectable and distinct type. (Wasn't that the whole
point of this thread?) So is QUtf8String if such a thing were to come
into existence.

>> If done cleverly, even the (pointer, size) variants should be able to
>> wrap the arguments in a View, such that those method definitions are
>> trivial.
> 
> View = (pointer,size) pair.

I meant that e.g. it would not be hard to make:

  foo(CharType const* s, SizeType L)

...be a simple wrapper around:

  foo(View::type s);

...which is itself either a template (per above), or several
non-template functions taking various types of views (status quo). No
combinatorial explosion of code per possible pointer type.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-12 Thread Thiago Macieira
On Thursday, 11 July 2019 13:41:49 -03 Matthew Woehlke wrote:
> On 11/07/2019 05.05, Mutz, Marc via Development wrote:
> > There is a cost associated with another string class, too, and it's
> > combinatorial explosion. Even when we have all view types
> > (QLatin1StringView, QUtf8StringView, QStringView), consider the overload
> > set of QString::replace(), ignoring the (ptr, size) variants:
> > 
> >{QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar}
> > 
> > that's 16 overloads. And that's without a possible QUtf32StringView.
> 
> So?
> 
> The right way to handle this is for those methods to be templated, in
> which case a) the code only needs to be written O(1) times, not O(N)
> times, and b) users can potentially specialize for their own string
> types as well.

Except that the whole point of those methods is that they can be more 
efficient when the encoding is known and therefore templating won't help. 
Templating won't make overload resolution any faster, but will make 
compilation times slower. And if we want to make use of the fact that a string 
is UTF-8, the templates won't work.

Right now, we know bytelength(latin1string) == codepointlength(utf16string), 
so we know how to efficiently replace and we apply that knowledge to indexOf, 
startsWith, endsWith, etc.. That's not the case for UTF-8, so algorithms will 
begin to differ very quickly.

> If done cleverly, even the (pointer, size) variants should be able to
> wrap the arguments in a View, such that those method definitions are
> trivial.

View = (pointer,size) pair.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-12 Thread Bernhard Lindner

> please, if it can be avoided, don't add yet another string-related class to 
> Qt. Knowing
> when to properly use QString, QByteArray, QLatin1String, QStringLiteral, 
> QStringRef and
> QStringView (I may have missed a few) is already a challenge. And I imagine 
> for people
> new to Qt it can even be a strong deterrent (after all, strings are something 
> you tend
> to use even in a simple Hello World - the first app most people see or write 
> in a new
> language/ framework).

I totally agree.

Maybe this helps (I could not find such a document):
https://bugreports.qt.io/browse/QTBUG-77020

-- 
Best Regards,
Bernhard Lindner

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-11 Thread Tomasz Siekierda
On Thu, 11 Jul 2019 at 18:43, Matthew Woehlke 
wrote:

> On 11/07/2019 05.05, Mutz, Marc via Development wrote:
> > There is a cost associated with another string class, too, and it's
> > combinatorial explosion. Even when we have all view types
> > (QLatin1StringView, QUtf8StringView, QStringView), consider the overload
> > set of QString::replace(), ignoring the (ptr, size) variants:
> >
> >{QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar}
> >
> > that's 16 overloads. And that's without a possible QUtf32StringView.
>
> So?
>
>
I have nothing to say in this discussion, but just want to throw in one
small hint/request/worry:

please, if it can be avoided, don't add yet another string-related class to
Qt. Knowing when to properly use QString, QByteArray, QLatin1String,
QStringLiteral, QStringRef and QStringView (I may have missed a few) is
already a challenge. And I imagine for people new to Qt it can even be a
strong deterrent (after all, strings are something you tend to use even in
a simple Hello World - the first app most people see or write in a new
language/ framework).


> The right way to handle this is for those methods to be templated, in
> which case a) the code only needs to be written O(1) times, not O(N)
> times, and b) users can potentially specialize for their own string
> types as well.
>
> If done cleverly, even the (pointer, size) variants should be able to
> wrap the arguments in a View, such that those method definitions are
> trivial.
>
> --
> Matthew
> ___
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development
>
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-11 Thread Matthew Woehlke
On 11/07/2019 05.05, Mutz, Marc via Development wrote:
> There is a cost associated with another string class, too, and it's
> combinatorial explosion. Even when we have all view types
> (QLatin1StringView, QUtf8StringView, QStringView), consider the overload
> set of QString::replace(), ignoring the (ptr, size) variants:
> 
>    {QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar}
> 
> that's 16 overloads. And that's without a possible QUtf32StringView.

So?

The right way to handle this is for those methods to be templated, in
which case a) the code only needs to be written O(1) times, not O(N)
times, and b) users can potentially specialize for their own string
types as well.

If done cleverly, even the (pointer, size) variants should be able to
wrap the arguments in a View, such that those method definitions are
trivial.

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-11 Thread Mutz, Marc via Development

On 2019-07-11 10:13, André Pönitz wrote:

On Wed, Jul 10, 2019 at 10:01:04PM -0300, Thiago Macieira wrote:

On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote:
> As far as I understand there's a perceived need to have "full" utf8
> literals, and there's a need to have ASCII literals. First could be
> served by some QUtf8*, second by QAscii*, both additions, no need to
> change QLatin* semantics.

ASCII = Latin1


bool = char ?

circle = ellipse ?

It's a subset, it is special enough to be called by its name. 
Especially
if it has features (e.g. toUpper/toLower operating on single letters) 
that

are not present in the larger set.

The line of discussion here is

  - people (correctly, happily) use toUpper on (7-bit clean US-ASCII) 
data

  - ASCII is claimed to be identical to Latin1
  - since it is identical it is superfluous to have both and ASCII is 
dropped

  - toUpper does not work per-char for Latin1 in corner cases
  - so it needs to be dropped "to avoid wrong use"



There is a cost associated with another string class, too, and it's 
combinatorial explosion. Even when we have all view types 
(QLatin1StringView, QUtf8StringView, QStringView), consider the overload 
set of QString::replace(), ignoring the (ptr, size) variants:


   {QL1V, QU8V, QSV, QChar} x {QL1V, QU8V, QSV, QChar}

that's 16 overloads. And that's without a possible QUtf32StringView. 
Ditto for the relational operators. Add QAsciiStringView and you're up 
to 25. Mind you, this is the math for the end game: no more const char*, 
const char8_t*, and (ptr, size) overloads as they've all been subsumed 
by their corresponding views. We'll be there, maybe, come Qt 7. The math 
is even worse until then.


In the end this deprives users from a useful tool in a scenario where 
it

was perfectly fine to use.


I don't see how. Users will be able to use QU8V or QL1V's toUppper() and 
they'll just work for US-ASCII. The L1 algorithm can be coded such that 
only ß and \xFF are on a slow path. Or maybe it's the case that 
toUpper() doesn't extend the length of UTF-8-encoded text? Maybe we're 
lucky and Unicode finally gets that the capital letter ß isn't SS, but 
ẞ, and we can then just document that if the capital letter isn't 
representable in L1, then it stays unchanged.


I'm still not convinced that QAsciiString is needed for any of this.

Thanks,
Marc
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-11 Thread André Pönitz
On Wed, Jul 10, 2019 at 10:01:04PM -0300, Thiago Macieira wrote:
> On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote:
> > As far as I understand there's a perceived need to have "full" utf8
> > literals, and there's a need to have ASCII literals. First could be
> > served by some QUtf8*, second by QAscii*, both additions, no need to
> > change QLatin* semantics.
> 
> ASCII = Latin1

bool = char ?

circle = ellipse ?

It's a subset, it is special enough to be called by its name. Especially
if it has features (e.g. toUpper/toLower operating on single letters) that
are not present in the larger set.

The line of discussion here is 

  - people (correctly, happily) use toUpper on (7-bit clean US-ASCII) data
  - ASCII is claimed to be identical to Latin1
  - since it is identical it is superfluous to have both and ASCII is dropped
  - toUpper does not work per-char for Latin1 in corner cases
  - so it needs to be dropped "to avoid wrong use"

In the end this deprives users from a useful tool in a scenario where it
was perfectly fine to use.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread Thiago Macieira
On Wednesday, 10 July 2019 22:01:04 -03 Thiago Macieira wrote:
> On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote:
> > As far as I understand there's a perceived need to have "full" utf8
> > literals, and there's a need to have ASCII literals. First could be
> > served by some QUtf8*, second by QAscii*, both additions, no need to
> > change QLatin* semantics.
> 
> ASCII = Latin1

In the sense that the class holding ASCII should be the Latin1 class, for the 
reasons that Marc presented. It's actually faster to convert from Latin1 to 
UTF-16 than from US-ASCII to UTF-16 (unless we declare out-of-bounds US-ASCII 
UB).

The only issue is what to do with the transforming functions toUpper and 
toLower.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread Thiago Macieira
On Wednesday, 10 July 2019 09:55:02 -03 André Pönitz wrote:
> As far as I understand there's a perceived need to have "full" utf8
> literals, and there's a need to have ASCII literals. First could be
> served by some QUtf8*, second by QAscii*, both additions, no need to
> change QLatin* semantics.

ASCII = Latin1

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel System Software Products



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread Matthew Woehlke
On 10/07/2019 09.10, Mutz, Marc via Development wrote:
> The other reason is about error checking: What should the result be of
> putting an æ into a QAsciiString? Assert at runtime? UB? In
> QLatin1String, this error just can't happen. Even if you feed it UTF-8,
> you may get mojibake, because you picked the wrong encoding, but it's
> not an error. Any UTF-8 octet sequence is a valid L1 string.
> 
> So, I don't see QAscii* pulling it's weight.

The reason ASCII might be helpful is that it guarantees certain
transformations (e.g. case conversion) in-place. L1 can't do this; the
L1 upper-case of U+00DF ('ß') is "SS". U+00FF ('ÿ') is in a similar
boat; I'm not sure it *has* an L1 upper-case. (The "proper" upper-case
is, I presume, U+0178, which is not in L1.)

Also, conversion from ASCII to either L1 or UTF-8 is a no-op. (ASCII to
UTF-16 can also be done with strict widening, but that's true for L1 also.)

-- 
Matthew
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread Mutz, Marc via Development

On 2019-07-10 14:55, André Pönitz wrote:
On Wed, Jul 10, 2019 at 11:29:15AM +0200, Mutz, Marc via Development 
wrote:

On 2019-07-10 10:50, Arnaud Clere wrote:
> Hi all,
>
> So, do I understand correctly that:
> 1. QUtf8String may be required in Qt7 to solve problems due to C++2x
> char8_t

I wouldn't say required. I also don't think it needs to wait until Qt 
7. Qt
7 is where we may depend on C++20 and can use char8_t in the interface 
and
implementation, but we should certainly not wait for that to add the 
class.
It's certainly a good idea, IMO, to have views and owning containers 
that

operate on L1, UTF-8 and UTF-16 strings. The views are more important.

> 2. QByteArray methods currently operating on latin1 may be restricted
> to ascii in Qt6 to avoid problems when const char* input really is
> utf8

I have no opinion on that.

> 3. QLatin1String may become QLatin1StringView by Qt7

Qt 6. We can add the name as an alias now, make QLatin1String an 
owning
container for Qt 6.0 (it breaks no code, just makes it slower, and the 
port

is trivial), and QLatin1StringView becomes what QLatin1String is now.


As far as I understand there's a perceived need to have "full" utf8
literals, and there's a need to have ASCII literals. First could be
served by some QUtf8*, second by QAscii*, both additions, no need to
change QLatin* semantics.


L1 is special because it's the first plane of Unicode, so conversion 
between the two will always be faster than between other encodings. This 
is why it makes sense to use all 8 bits and have L1, not artificially 
restrict to US-ASCII strings. That's one reason: opportunism.


The other reason is about error checking: What should the result be of 
putting an æ into a QAsciiString? Assert at runtime? UB? In 
QLatin1String, this error just can't happen. Even if you feed it UTF-8, 
you may get mojibake, because you picked the wrong encoding, but it's 
not an error. Any UTF-8 octet sequence is a valid L1 string.


So, I don't see QAscii* pulling it's weight.

Thanks,
Marc
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread André Pönitz
On Wed, Jul 10, 2019 at 11:29:15AM +0200, Mutz, Marc via Development wrote:
> On 2019-07-10 10:50, Arnaud Clere wrote:
> > Hi all,
> > 
> > So, do I understand correctly that:
> > 1. QUtf8String may be required in Qt7 to solve problems due to C++2x
> > char8_t
> 
> I wouldn't say required. I also don't think it needs to wait until Qt 7. Qt
> 7 is where we may depend on C++20 and can use char8_t in the interface and
> implementation, but we should certainly not wait for that to add the class.
> It's certainly a good idea, IMO, to have views and owning containers that
> operate on L1, UTF-8 and UTF-16 strings. The views are more important.
> 
> > 2. QByteArray methods currently operating on latin1 may be restricted
> > to ascii in Qt6 to avoid problems when const char* input really is
> > utf8
> 
> I have no opinion on that.
> 
> > 3. QLatin1String may become QLatin1StringView by Qt7
> 
> Qt 6. We can add the name as an alias now, make QLatin1String an owning
> container for Qt 6.0 (it breaks no code, just makes it slower, and the port
> is trivial), and QLatin1StringView becomes what QLatin1String is now.

As far as I understand there's a perceived need to have "full" utf8
literals, and there's a need to have ASCII literals. First could be
served by some QUtf8*, second by QAscii*, both additions, no need to
change QLatin* semantics.

Andre'
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread Mutz, Marc via Development

On 2019-07-10 10:50, Arnaud Clere wrote:

Hi all,

So, do I understand correctly that:
1. QUtf8String may be required in Qt7 to solve problems due to C++2x 
char8_t


I wouldn't say required. I also don't think it needs to wait until Qt 7. 
Qt 7 is where we may depend on C++20 and can use char8_t in the 
interface and implementation, but we should certainly not wait for that 
to add the class. It's certainly a good idea, IMO, to have views and 
owning containers that operate on L1, UTF-8 and UTF-16 strings. The 
views are more important.



2. QByteArray methods currently operating on latin1 may be restricted
to ascii in Qt6 to avoid problems when const char* input really is
utf8


I have no opinion on that.


3. QLatin1String may become QLatin1StringView by Qt7


Qt 6. We can add the name as an alias now, make QLatin1String an owning 
container for Qt 6.0 (it breaks no code, just makes it slower, and the 
port is trivial), and QLatin1StringView becomes what QLatin1String is 
now.


4. These classes will be independent except maybe for a common internal 
class


Yes. Or separate instantiations of the same class template. They also 
should convert to QByteArray. Just not by public inheritance.


Thannks,
Marc
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] char8_t summary?

2019-07-10 Thread Arnaud Clere
Hi all,

So, do I understand correctly that:
1. QUtf8String may be required in Qt7 to solve problems due to C++2x char8_t
2. QByteArray methods currently operating on latin1 may be restricted to ascii 
in Qt6 to avoid problems when const char* input really is utf8
3. QLatin1String may become QLatin1StringView by Qt7
4. These classes will be independent except maybe for a common internal class

Arnaud
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development