Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
On Tuesday, 6 February 2024 22:02:11 PST Marc Mutz via Development wrote: > As someone who argues that qHash("Hello"_L1) be restored to > qHash(u"Hello"_s) after this relation was broken somewhere in Qt 5, how > could I argue against it? :) It was commit ea8e48a6799cf742ea23f4a30dcfc38a4988fe56 in 5.3.0. commit ea8e48a6799cf742ea23f4a30dcfc38a4988fe56 Author: Thiago Macieira Date: Thu Dec 19 23:32:04 2013 -0800 Update the qHash function for strings to use the CRC32 instruction According to my profiling of Qt Creator, qHash and the SHA-1 calculation are the hottest spots remaining in QtCore. The current qHash function is not really vectorizable. We could come up with a different algorithm that is more SIMD-friendly, but since we have the CRC32 instruction that can read 32- and 64-bit entities, we're set. This commit also updates the benchmark for QHash and benchmarks both the hashing function itself and the QHash class. The updated benchmarks for the CRC32 on my machine shows that the hashing function is *always* improved, but the hashing isn't always. In particular, the current algorithm is better for the "numbers" case, for which the data sample differs in very few bits. The new code is 33% slower for that particular case. On average, the improvement (including the "numbers" case) is: compared to qHash only QHash Qt 5.0 function 2.54x1.06x Qt 4.x function 4.34x1.34x Java function2.71x1.11x Now (current = siphash & murmurhash; nonzero_curent = aeshash; we don't have the CRC32 algorithm any more): RESULT : tst_QHash::hashing_qt4():"numbers": 126,800.35305 CPU cycles per iteration, 4.66 GHz 470,120.00894 instructions per iteration, 3.708 instr/cycle RESULT : tst_QHash::hashing_qt50():"numbers": 83,198.10215 CPU cycles per iteration, 4.64 GHz 365,099.00604 instructions per iteration, 4.388 instr/cycle RESULT : tst_QHash::hashing_current():"numbers": 151,220.07727 CPU cycles per iteration, 4.66 GHz 615,149.01052 instructions per iteration, 4.068 instr/cycle RESULT : tst_QHash::hashing_nonzero_current():"numbers": 65,224.59934 CPU cycles per iteration, 4.45 GHz 235,073.00508 instructions per iteration, 3.604 instr/cycle So even the pathological "numbers" case is now faster by about 27%. With "typical" data ("paths-small"): qt4:23,848.939476 CPU cycles per iteration qt50: 11,681.268231 CPU cycles per iteration current:10,623.703491 CPU cycles per iteration aeshash:2,870.472204 CPU cycles per iteration The siphash/murmurhash implementation is slightly faster than the Qt 5.0 implementation and both are over twice as fast as the 4.x implementation. The aeshash is on average 3.7x faster than even those two. And with very long data ("longstrings-list"): qt4:397,145.21743 CPU cycles per iteration qt50: 225,105.45316 CPU cycles per iteration siphash:102,900.78968 CPU cycles per iteration aeshash:10,249.00974 CPU cycles per iteration That's 10x faster on my laptop, which has 256-bit AES. With just 128-bit AES, my Skylake workstation is doing about 3.5x better than siphash for this case. See more numbers in https://codereview.qt-project.org/c/qt/qtbase/+/537009 -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
On 05.02.24 23:42, Thiago Macieira wrote: > On Monday, 5 February 2024 01:36:39 PST Marc Mutz via Development wrote: >> I've always understood it such that we as Qt must preserve the property >> that the hash for equal elements is equal within _one_ run of _one_ >> process. This means you can use the hash in I/O in any way. That's why >> we have qt_hash, which you _can_ (and do) use in I/O (but is private >> API, AFAIK). I never thought that a hash seed of zero would change that, >> but of course users may have come to depend on this (Hyrum's Law), so a >> [ChangeLog] would be in order. > > In commit e3f05981cbeb0b7721f960ef88effa77be2af5ce, I added this comment to > qHashBits: > // mix in the length as a secondary seed. For seed == 0, seed2 must be > // size, to match what we used to do prior to Qt 6.2. > > Which is why I am asking now, because making this change would go against that > comment. But there was no discussion in the change about whether this was > correct or not. It seems I just write it like that. > > However, that was qHashBits(). The change I'm talking about is > qHash(QLatin1StringView), specifically so it won't call qHashBits(). As someone who argues that qHash("Hello"_L1) be restored to qHash(u"Hello"_s) after this relation was broken somewhere in Qt 5, how could I argue against it? :) -- Marc Mutz Principal Software Engineer The Qt Company Erich-Thilo-Str. 10 12489 Berlin, Germany www.qt.io Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 144331 B -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
On 05.02.24 19:07, apoenitz wrote: > On Mon, Feb 05, 2024 at 09:36:39AM +, Marc Mutz via Development wrote: >> Hi, >> >> I've always understood it such that we as Qt must preserve the property >> that the hash for equal elements is equal within _one_ run of _one_ >> process. > > This would make debugging and testing of applications using QHash > less deterministic. > > There needs to ba some way to guarantee identical behavior of multiple runs > of the same binary. Setting the seed to 0 is currently such a way You're right, my comment was overly simplistic. The non-determinism should be in the seed, and nowhere else. So developers should be able to rely on hash functions yielding the same value if the same seed is given _and the same Qt version is used_. That doesn't mean that applications or developers can rely on seed = 0 yielding the same hash value across Qt releases (even patch ones). The end result is the same, though: we can - definitely change exported qHash() functions - definitely _not_ change inline qHash() functions - possibly _not_ change any other qHash() function (there shouldn't be any in this category, but historically, there were some). Hope that clears it up. Thanks, Marc -- Marc Mutz Principal Software Engineer The Qt Company Erich-Thilo-Str. 10 12489 Berlin, Germany www.qt.io Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 144331 B -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
On Monday, 5 February 2024 01:36:39 PST Marc Mutz via Development wrote: > I've always understood it such that we as Qt must preserve the property > that the hash for equal elements is equal within _one_ run of _one_ > process. This means you can use the hash in I/O in any way. That's why > we have qt_hash, which you _can_ (and do) use in I/O (but is private > API, AFAIK). I never thought that a hash seed of zero would change that, > but of course users may have come to depend on this (Hyrum's Law), so a > [ChangeLog] would be in order. In commit e3f05981cbeb0b7721f960ef88effa77be2af5ce, I added this comment to qHashBits: // mix in the length as a secondary seed. For seed == 0, seed2 must be // size, to match what we used to do prior to Qt 6.2. Which is why I am asking now, because making this change would go against that comment. But there was no discussion in the change about whether this was correct or not. It seems I just write it like that. However, that was qHashBits(). The change I'm talking about is qHash(QLatin1StringView), specifically so it won't call qHashBits(). -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
Hi, I've always understood it such that we as Qt must preserve the property that the hash for equal elements is equal within _one_ run of _one_ process. This means you can use the hash in I/O in any way. That's why we have qt_hash, which you _can_ (and do) use in I/O (but is private API, AFAIK). I never thought that a hash seed of zero would change that, but of course users may have come to depend on this (Hyrum's Law), so a [ChangeLog] would be in order. From that follows that you can change the hash algorithm, provided the hash function is out-of-line exported, but you can't if it's inline (or, theoretically, non-exported, because some user may have copied the implementation into his application or wrote a conflicting one; which is why I'd suggest to =delete non-Qt-provided qHash functions: to prevent users from writing their own). HTH, Marc On 03.02.24 22:08, Thiago Macieira wrote: > Requirement: the qHash function in question is and has always been out-of- > line. We obviously can't change the hashing function of inline code, because > it may have been inlined. > > When the seed is non-zero, we make no guarantees on what the result will be. > The result is allowed to change between Qt versions and between machines, even > for the same seed. Strictly speaking, you're not supposed to pass replay a > seed, because some hash functions have a hidden seed you can't get or set. > > But what about a zero seed? It's what we call a "deterministic hashing". We > have changed the algorithms on .0 releases (in both 5.0 and 6.0 for QString). > > The reason I'm asking is that right now > > qHash(QLatin1StringView(str)) == qHash(QByteArrayView(str)) > > but it would be far more useful to have > > qHash(QLatin1StringView(str)) == qHash(QString(str)) > > I've made the change for the non-zero seeds, but one couldn't rely on the > above unless it applied for every seed.f > > -- Marc Mutz Principal Software Engineer The Qt Company Erich-Thilo-Str. 10 12489 Berlin, Germany www.qt.io Geschäftsführer: Mika Pälsi, Juha Varelius, Jouni Lintunen Sitz der Gesellschaft: Berlin, Registergericht: Amtsgericht Charlottenburg, HRB 144331 B -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
On Sunday, 4 February 2024 06:12:18 PST Giuseppe D'Angelo via Development wrote: > Il 03/02/24 22:08, Thiago Macieira ha scritto: > > But what about a zero seed? It's what we call a "deterministic hashing". > > We > > have changed the algorithms on .0 releases (in both 5.0 and 6.0 for > > QString). > > I don't think it means "deterministic" in the sense that the output will > never change across Qt versions. It just means deterministic across > multiple runs of the very same application using the very same Qt version. Indeed but how would it know that it is the same Qt version? It might be running with a system Qt that got upgraded in the background, so a new start would change it. For that matter, it's far more likely that it's an application and its helper at the same time, but even then they can't guarantee they're using the same Qt. > When you disable QHash seeding, the choice of making the seed equal to 0 > is just "a choice"; we could make it 42 or 0xF0CACC1A, for what is > worth. qHash can still change its output at any time across Qt versions > and software should never ever rely on that. I'm not sure. -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
Re: [Development] What's our policy on changing the result of qHash(T, 0) between major releases?
Il 03/02/24 22:08, Thiago Macieira ha scritto: But what about a zero seed? It's what we call a "deterministic hashing". We have changed the algorithms on .0 releases (in both 5.0 and 6.0 for QString). I don't think it means "deterministic" in the sense that the output will never change across Qt versions. It just means deterministic across multiple runs of the very same application using the very same Qt version. When you disable QHash seeding, the choice of making the seed equal to 0 is just "a choice"; we could make it 42 or 0xF0CACC1A, for what is worth. qHash can still change its output at any time across Qt versions and software should never ever rely on that. My 2 c, -- Giuseppe D'Angelo | giuseppe.dang...@kdab.com | Senior Software Engineer KDAB (France) S.A.S., a KDAB Group company Tel. France +33 (0)4 90 84 08 53, http://www.kdab.com KDAB - Trusted Software Excellence smime.p7s Description: Firma crittografica S/MIME -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development
[Development] What's our policy on changing the result of qHash(T, 0) between major releases?
Requirement: the qHash function in question is and has always been out-of- line. We obviously can't change the hashing function of inline code, because it may have been inlined. When the seed is non-zero, we make no guarantees on what the result will be. The result is allowed to change between Qt versions and between machines, even for the same seed. Strictly speaking, you're not supposed to pass replay a seed, because some hash functions have a hidden seed you can't get or set. But what about a zero seed? It's what we call a "deterministic hashing". We have changed the algorithms on .0 releases (in both 5.0 and 6.0 for QString). The reason I'm asking is that right now qHash(QLatin1StringView(str)) == qHash(QByteArrayView(str)) but it would be far more useful to have qHash(QLatin1StringView(str)) == qHash(QString(str)) I've made the change for the non-zero seeds, but one couldn't rely on the above unless it applied for every seed.f -- Thiago Macieira - thiago.macieira (AT) intel.com Cloud Software Architect - Intel DCAI Cloud Engineering smime.p7s Description: S/MIME cryptographic signature -- Development mailing list Development@qt-project.org https://lists.qt-project.org/listinfo/development