[Development] Renaming quint128

2022-11-17 Thread Thiago Macieira
I was working on extended integers and added qint128 and quint128 to qglobal.h 
(qtypes.h), but when I tried to rebuild all of Qt today, I found out that 
QtBluetooth has this in qbluetoothuuid.h:

struct quint128
{
quint8 data[16];
};

And it's used in the API, with a constructor and a toUInt128(), but that's 
all. It's also not documented.

I'd like to move it away so I can add the proper integer.

There's a way to replace it without breaking BC or SC:
1) on 64-bit systems with GCC and Clang, use the actual integer type
2) everywhere else, use the the struct
3) for QtBluetooth's own build, add a removed_api.cpp that also #undef 
__SIZEOF_INT128__

It might be a good idea to move that backup definition to QtCore, so 
QtBluetooth isn't depending on just how qtypes.h does it.
-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] How qAsConst and qExchange lead to qNN

2022-11-17 Thread Thiago Macieira
On Thursday, 17 November 2022 10:32:50 PST Elvis Stansvik wrote:
> Fermat's Last QString Vectorization Update :p

Everything is already set to Gerrit. What I haven't done is benchmark it to 
confirm the theoretical runs in LLVM-MCA.

It starts at
https://codereview.qt-project.org/c/qt/qtbase/+/386952

See the search at
https://codereview.qt-project.org/q/
is:open+owner:thiago.macieira%2540intel.com+message:QString

The changes are mostly organised as "reorganise the pre-AVX code", then 
"rewrite AVX2 code" then "add AVX512VL code" for each of the functions.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] How qAsConst and qExchange lead to qNN

2022-11-17 Thread Thiago Macieira
On Thursday, 17 November 2022 10:24:35 PST Volker Hilsheimer via Development 
wrote:
> > Though I am postponing the QString vectorisation update to 6.6 because I
> > don't have time to collect the benchmarks to prove I'm right before
> > feature freeze next Friday.
> 
> Next Friday is the platform & module freeze. Feature freeze is not until
> December 9th, i.e. another 3 weeks to go.

Next Friday is also the day after Thanksgiving here in the US.

I don't expect I can finish the benchmarking in 3 weeks, not considering I need 
to finish the IPC work and that includes starting a couple of changes that I 
haven't started yet (like the ability to clean up after itself).

For the benchmarking, I've already collected the data by instrumenting each of 
the functions in question and running a Qt build, a Qt Creator start and a Qt 
build inside Qt Creator:

qt-build-data.tar.xz: 1197.3 MB
qtcreator-nosession.tar.xz: 2690.0 MB
qtcreator-session.tar.xz: 35134.6 MB

The data retains its intra-cacheline alignment.

The way I'm seeing it, is that for each of the algorithm generations, I need 
to:
1) find the asymptotic limits, given L1, L2 and L3 cache sizes
That is, the algorithms should be fast enough that the bottleneck is the 
transfer of data. There's no way that running qustrchr on 35 GB is going to be 
bound by anything other than RAM bandwidth or, in my laptop's case, the NVMe. 
So what are those limits?

2) benchmark at several data set sizes (half to 75% of L1, half to 75% of L2) 
on several generations
Confirm that the algorithm is running close to or better than the ideal run 
that LLVM-MCA showed when I designed them. I know I can benchmark throughput 
to see if we're reaching the target bytes/cycle processing, but I don't know 
if I can benchmark the latency. I also don't know if it matters.

3) benchmark at several input sizes (i.e., strings of 4 characters, 8 
characters, etc.)
Same as #2, but instead of running over the sample that adds up to a certain 
data size, select the input such that the strings have always the same size.

4) compare to the previous generation's algorithm to confirm it's actually 
better
Different instructions have different pros and cons; what might work for one at 
a given data size may not for another

The algorithms available are:
* baseline SSE2: no comparisons
* SSE 4.1: compare to baseline SSE2, current SSE 4.1
* AVX2: compare to new SSE 4.1, current AVX2
* AVX512 with 256-bit vectors ("Avx256"): compare to new AVX2

I plan on collecting data on 3 laptop processors (Haswell, Skylake and Tiger 
Lake) and 2 desktop processors (Coffee Lake and Skylake Extreme). The Skylake 
should match the performance of almost all the Skylake and derivatives since 
2016; the Coffee Lake NUC has the same processor as my Mac Mini; the Tiger Lake 
should be the performance of modern processors. The Skylake Extreme and the 
Tiger Lake can run the AVX512 code too. I don't know if the AVX512 code on 
Skylake will show a performance gain or a loss, because despite using only 256 
bits, it may need to power on the OpMask registers. If it doesn't, I will 
adjust the feature detection to only apply to Ice Lakes and later.

I have a new Alder Lake which would be nice to benchmark, to get the 
performance on both the Golden Cove P-core and the Gracemont E-core, but the 
thing runs Windows and the IT-mandated virus scans, so I will not bother.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] How qAsConst and qExchange lead to qNN

2022-11-17 Thread Elvis Stansvik
Den tors 17 nov. 2022 kl 18:46 skrev Thiago Macieira
:
>
> On Thursday, 17 November 2022 02:04:54 PST Marc Mutz via Development wrote:
> > > Also, sometimes I wonder if all the work you and I do to optimise these
> > > things matter, in the end. We may save 0.5% of the CPU time, only for
> > > that to be dwarfed by whatever QtGui, QtQml are doing.
> >
> > I hear you, but I'm not ready to give in just yet.
>
> Nor am I.
>
> Though I am postponing the QString vectorisation update to 6.6 because I don't
> have time to collect the benchmarks to prove I'm right before feature freeze
> next Friday.

Fermat's Last QString Vectorization Update :p

Elvis

>
> --
> Thiago Macieira - thiago.macieira (AT) intel.com
>   Cloud Software Architect - Intel DCAI Cloud Engineering
>
>
>
> ___
> Development mailing list
> Development@qt-project.org
> https://lists.qt-project.org/listinfo/development
___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] How qAsConst and qExchange lead to qNN

2022-11-17 Thread Volker Hilsheimer via Development



> On 17 Nov 2022, at 18:45, Thiago Macieira  wrote:
> 
> On Thursday, 17 November 2022 02:04:54 PST Marc Mutz via Development wrote:
>>> Also, sometimes I wonder if all the work you and I do to optimise these
>>> things matter, in the end. We may save 0.5% of the CPU time, only for
>>> that to be dwarfed by whatever QtGui, QtQml are doing.
>> 
>> I hear you, but I'm not ready to give in just yet.
> 
> Nor am I.
> 
> Though I am postponing the QString vectorisation update to 6.6 because I 
> don't 
> have time to collect the benchmarks to prove I'm right before feature freeze 
> next Friday.


Next Friday is the platform & module freeze. Feature freeze is not until 
December 9th, i.e. another 3 weeks to go.

Volker

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] How qAsConst and qExchange lead to qNN

2022-11-17 Thread Thiago Macieira
On Thursday, 17 November 2022 02:04:54 PST Marc Mutz via Development wrote:
> > Also, sometimes I wonder if all the work you and I do to optimise these
> > things matter, in the end. We may save 0.5% of the CPU time, only for
> > that to be dwarfed by whatever QtGui, QtQml are doing.
> 
> I hear you, but I'm not ready to give in just yet.

Nor am I.

Though I am postponing the QString vectorisation update to 6.6 because I don't 
have time to collect the benchmarks to prove I'm right before feature freeze 
next Friday.

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Cloud Software Architect - Intel DCAI Cloud Engineering



___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development


Re: [Development] How qAsConst and qExchange lead to qNN

2022-11-17 Thread Marc Mutz via Development
Hi Thiago,

On 16.11.22 18:50, Thiago Macieira wrote:
> On Tuesday, 15 November 2022 23:50:38 PST Marc Mutz via Development wrote:
>>> in a thread-safe manner (such that if something in
>>> the same thread or another thread-safely modifies that map, the original
>>> user isn't affected).
>>
>> The above isn't thread-safe, it isn't even re-entrant, in the same way
>> that iteration using iterators isn't. This is a known issue whenever you
>> hand out references, and it's nothing that violates our
>> const-is-thread-safe promise,
> 
> No, but it moves the responsibility for avoiding this problem to the user.
> 
> Right now, you can do:
>for (auto elem : object.keyList()) {
>operate(); // may recurse back into object and modify it
>}
> 
> If you use a generator paradigm to return this key list, then the *user* must
> know that they must create a local container with the items to be generated
> and iterate over that. Performance-wise, this no different than if the Qt code
> created the container and returned it, but it has two drawbacks:
> 
> 1) the responsibility for knowing this

Not necessarily. E.g. when the co-routine implementation uses the equivalent of 
an indexed 
loop, it immunizes itself from changes to the container while it's 
suspended. It can also post a re-entrancy guard in the class' data, like 
we sometimes already do in event handlers and often do in slots, to at least 
detect and mitigate the issue.

This isn't different from emitting signals or calling virtual functions 
while iterating, and the solutions are the same, and, largely, if not 
completely, under the control of the co-routine implementation.

That said, it's not entirely clear to me how widespread such issues are. After 
all, the user or a generator sees the potentially-re-entering code, it's in the 
function he's presently writing/analyzing, and not hidden in the way 
signal/slot connections or even virtual functions hide the issue by having 
far-removed code cause the problem. So I don't know whether the benefits of 
lazy evaluation outweight or are dwarfed by this issue.

> 2) if the Qt object already has a QList with this, then using a generator
> paradigm enforces the need of a deep copy, when implicit would have been
> cheaper

I hasten to interject here that the code you wrote above actually does 
deep-copy in that case (hidden detach in the for loop).

Apart from that, we're circling back to the assumption that a class would hold 
or return a QList for the sake of QList. For holding, and also for returning, 
if one must return an owning container, a QVLA or otherwise SBO'd container 
would be more appropriate in many cases. The lack of such containers in Qt 
begets the use of QList in the first place. To get out of this tread-mill, one 
needs to look outside the Qt echo chamber, to std C++ (std::u16string, 
std::pmr), Folly (F14 (hash table), fbstring (SSO, CoW only for large 
strings)), Python (strings are QAnyString with SSO there), LLVM 
(llvm::SmallVector, StringRef, ArrayRef), Mozilla's JS strings (L1/UTF-16 
QAnyString, SSO). Then work backwards from these kinds of containers to how we 
can enable them in Qt.

>>> Because you pointed to QStringTokenizer and that implicitly-
>>> copies a QString.
>>
>> That's imprecise. QStringTokenizer extends rvalue lifetimes ("rvalue
>> pinning") so's to make this safe:
>>
>>  for (auto part : qTokenize(label->text(), u';'))
> 
> BTW, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2012r2.pdf is
> accepted for C++23 and moves the end of the temporaries' lifetimes to the end
> of the full for statement.

Hallelujah! Thanks, Nico!

> Though we still need to work with C++17 and 20 for a while.
> 
> Also, sometimes I wonder if all the work you and I do to optimise these things
> matter, in the end. We may save 0.5% of the CPU time, only for that to be
> dwarfed by whatever QtGui, QtQml are doing.

I hear you, but I'm not ready to give in just yet.

Thanks,
Marc

-- 
Marc Mutz 
Principal Software Engineer

The Qt Company
Erich-Thilo-Str. 10 12489
Berlin, Germany
www.qt.io

Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen
Sitz der Gesellschaft: Berlin,
Registergericht: Amtsgericht Charlottenburg,
HRB 144331 B

___
Development mailing list
Development@qt-project.org
https://lists.qt-project.org/listinfo/development