- Revision
- 291027
- Author
- rmoris...@apple.com
- Date
- 2022-03-08 18:56:37 -0800 (Tue, 08 Mar 2022)
Log Message
[WTF] LikelyDenseUnsignedIntegerSet::add can cause a reindexing of the entire bit vector with every call in the worst case
https://bugs.webkit.org/show_bug.cgi?id=236997
Reviewed by Saam Barati.
Source/_javascript_Core:
Just make it a little bit easier to change the number of stack slots in testZDefOfSpillSlotWithOffsetNeedingToBeMaterializedInARegister.
* b3/air/testair.cpp:
Source/WTF:
This problem was found while investigating https://bugs.webkit.org/show_bug.cgi?id=236269.
LikelyDenseUnsignedIntegerSet has two forms: either as a HashSet (if sparse) or as a BitVector representing numbers above some value m_min (if sufficiently dense).
This is a massive memory win in most situations (>4x in practice for register allocation in JS2, >20x on some pathological webpages).
But it means that when adding a value below m_min to a set in BitVector shape, we have to rebuild the whole set, which takes a time proportional to the time of the set.
So if building a set by repeatedly adding decreasing values (like in https://bugs.webkit.org/show_bug.cgi?id=236269 where we add 10000, then 9999, then 9998, etc..), we have some awful performance.
In this patch I improve this situation in two ways:
- First I always round down m_min to the next multiple of 64. This means that when adding contiguous values like above we only re-index once every 64 adds.
- It then allows me to do the reindexing by simple memcpy instead of costly iteration of all the set bits, since they are now always at the same offset within the words of the BitVector.
On an M1 MBP, on testair:: testZDefOfSpillSlotWithOffsetNeedingToBeMaterializedInARegister, with n=5000, in release mode, measuring just the time spent building the interference graph:
Before this patch: 107 s
After this patch: 77 ms (note the different unit, it is not a typo!)
Unfortunately, it does not seem to significantly improve the time spent in LikelyDenseUnsignedIntgerSet::add in JetStream2,
probably because the pattern of always adding a value just before the minimum is quite pathological/rare.
I still think it is worth landing, as we don't know what code out there may hit this performance problem.
* wtf/BitVector.cpp:
(WTF::BitVector::shiftRightByMultipleOf64):
(WTF::BitVector::resizeOutOfLine):
* wtf/BitVector.h:
* wtf/LikelyDenseUnsignedIntegerSet.h:
(WTF::LikelyDenseUnsignedIntegerSet::add):
(WTF::LikelyDenseUnsignedIntegerSet::validate const):
(WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector):
Modified Paths
Diff
Modified: trunk/Source/_javascript_Core/ChangeLog (291026 => 291027)
--- trunk/Source/_javascript_Core/ChangeLog 2022-03-09 02:53:09 UTC (rev 291026)
+++ trunk/Source/_javascript_Core/ChangeLog 2022-03-09 02:56:37 UTC (rev 291027)
@@ -1,3 +1,14 @@
+2022-03-08 Robin Morisset <rmoris...@apple.com>
+
+ [WTF] LikelyDenseUnsignedIntegerSet::add can cause a reindexing of the entire bit vector with every call in the worst case
+ https://bugs.webkit.org/show_bug.cgi?id=236997
+
+ Reviewed by Saam Barati.
+
+ Just make it a little bit easier to change the number of stack slots in testZDefOfSpillSlotWithOffsetNeedingToBeMaterializedInARegister.
+
+ * b3/air/testair.cpp:
+
2022-03-08 Robin Morisset <rmoris...@apple.com>
Enable tier-up in loops created by recursive tail call optimizations.
Modified: trunk/Source/_javascript_Core/b3/air/testair.cpp (291026 => 291027)
--- trunk/Source/_javascript_Core/b3/air/testair.cpp 2022-03-09 02:53:09 UTC (rev 291026)
+++ trunk/Source/_javascript_Core/b3/air/testair.cpp 2022-03-09 02:56:37 UTC (rev 291027)
@@ -2371,7 +2371,8 @@
BasicBlock* root = code.addBlock();
Vector<StackSlot*> slots;
- for (unsigned i = 0; i < 10000; ++i)
+ unsigned numberOfSlots = 10000;
+ for (unsigned i = 0; i < numberOfSlots; ++i)
slots.append(code.addStackSlot(8, StackSlotKind::Spill));
for (auto* slot : slots)
@@ -2388,7 +2389,7 @@
root->append(Ret64, nullptr, Tmp(GPRInfo::returnValueGPR));
int32_t result = compileAndRun<int>(proc, 1);
- CHECK(result == 10000);
+ CHECK(result == static_cast<int32_t>(numberOfSlots));
}
#define PREFIX "O", Options::defaultB3OptLevel(), ": "
Modified: trunk/Source/WTF/ChangeLog (291026 => 291027)
--- trunk/Source/WTF/ChangeLog 2022-03-09 02:53:09 UTC (rev 291026)
+++ trunk/Source/WTF/ChangeLog 2022-03-09 02:56:37 UTC (rev 291027)
@@ -1,3 +1,38 @@
+2022-03-08 Robin Morisset <rmoris...@apple.com>
+
+ [WTF] LikelyDenseUnsignedIntegerSet::add can cause a reindexing of the entire bit vector with every call in the worst case
+ https://bugs.webkit.org/show_bug.cgi?id=236997
+
+ Reviewed by Saam Barati.
+
+ This problem was found while investigating https://bugs.webkit.org/show_bug.cgi?id=236269.
+
+ LikelyDenseUnsignedIntegerSet has two forms: either as a HashSet (if sparse) or as a BitVector representing numbers above some value m_min (if sufficiently dense).
+ This is a massive memory win in most situations (>4x in practice for register allocation in JS2, >20x on some pathological webpages).
+ But it means that when adding a value below m_min to a set in BitVector shape, we have to rebuild the whole set, which takes a time proportional to the time of the set.
+ So if building a set by repeatedly adding decreasing values (like in https://bugs.webkit.org/show_bug.cgi?id=236269 where we add 10000, then 9999, then 9998, etc..), we have some awful performance.
+
+ In this patch I improve this situation in two ways:
+ - First I always round down m_min to the next multiple of 64. This means that when adding contiguous values like above we only re-index once every 64 adds.
+ - It then allows me to do the reindexing by simple memcpy instead of costly iteration of all the set bits, since they are now always at the same offset within the words of the BitVector.
+
+ On an M1 MBP, on testair:: testZDefOfSpillSlotWithOffsetNeedingToBeMaterializedInARegister, with n=5000, in release mode, measuring just the time spent building the interference graph:
+ Before this patch: 107 s
+ After this patch: 77 ms (note the different unit, it is not a typo!)
+
+ Unfortunately, it does not seem to significantly improve the time spent in LikelyDenseUnsignedIntgerSet::add in JetStream2,
+ probably because the pattern of always adding a value just before the minimum is quite pathological/rare.
+ I still think it is worth landing, as we don't know what code out there may hit this performance problem.
+
+ * wtf/BitVector.cpp:
+ (WTF::BitVector::shiftRightByMultipleOf64):
+ (WTF::BitVector::resizeOutOfLine):
+ * wtf/BitVector.h:
+ * wtf/LikelyDenseUnsignedIntegerSet.h:
+ (WTF::LikelyDenseUnsignedIntegerSet::add):
+ (WTF::LikelyDenseUnsignedIntegerSet::validate const):
+ (WTF::LikelyDenseUnsignedIntegerSet::transitionToBitVector):
+
2022-03-08 Alex Christensen <achristen...@webkit.org>
Fix Windows debug build
Modified: trunk/Source/WTF/wtf/BitVector.cpp (291026 => 291027)
--- trunk/Source/WTF/wtf/BitVector.cpp 2022-03-09 02:53:09 UTC (rev 291026)
+++ trunk/Source/WTF/wtf/BitVector.cpp 2022-03-09 02:56:37 UTC (rev 291027)
@@ -1,5 +1,5 @@
/*
- * Copyright (C) 2011 Apple Inc. All rights reserved.
+ * Copyright (C) 2011-2022 Apple Inc. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
@@ -89,20 +89,33 @@
BitVectorMalloc::free(outOfLineBits);
}
-void BitVector::resizeOutOfLine(size_t numBits)
+void BitVector::shiftRightByMultipleOf64(size_t shiftInBits)
{
+ RELEASE_ASSERT(!(shiftInBits % 64));
+ static_assert(!(8 % sizeof(void*)), "BitVector::shiftRightByMultipleOf64 assumes that word size is a divisor of 64");
+ size_t shiftInWords = shiftInBits / (8 * sizeof(void*));
+ size_t numBits = size() + shiftInBits;
+ resizeOutOfLine(numBits, shiftInWords);
+}
+
+void BitVector::resizeOutOfLine(size_t numBits, size_t shiftInWords)
+{
ASSERT(numBits > maxInlineBits());
OutOfLineBits* newOutOfLineBits = OutOfLineBits::create(numBits);
size_t newNumWords = newOutOfLineBits->numWords();
if (isInline()) {
+ memset(newOutOfLineBits->bits(), 0, shiftInWords * sizeof(void*));
// Make sure that all of the bits are zero in case we do a no-op resize.
- *newOutOfLineBits->bits() = m_bitsOrPointer & ~(static_cast<uintptr_t>(1) << maxInlineBits());
- memset(newOutOfLineBits->bits() + 1, 0, (newNumWords - 1) * sizeof(void*));
+ *(newOutOfLineBits->bits() + shiftInWords) = m_bitsOrPointer & ~(static_cast<uintptr_t>(1) << maxInlineBits());
+ RELEASE_ASSERT(shiftInWords + 1 <= newNumWords);
+ memset(newOutOfLineBits->bits() + shiftInWords + 1, 0, (newNumWords - 1 - shiftInWords) * sizeof(void*));
} else {
if (numBits > size()) {
size_t oldNumWords = outOfLineBits()->numWords();
- memcpy(newOutOfLineBits->bits(), outOfLineBits()->bits(), oldNumWords * sizeof(void*));
- memset(newOutOfLineBits->bits() + oldNumWords, 0, (newNumWords - oldNumWords) * sizeof(void*));
+ memset(newOutOfLineBits->bits(), 0, shiftInWords * sizeof(void*));
+ memcpy(newOutOfLineBits->bits() + shiftInWords, outOfLineBits()->bits(), oldNumWords * sizeof(void*));
+ RELEASE_ASSERT(shiftInWords + oldNumWords <= newNumWords);
+ memset(newOutOfLineBits->bits() + shiftInWords + oldNumWords, 0, (newNumWords - oldNumWords - shiftInWords) * sizeof(void*));
} else
memcpy(newOutOfLineBits->bits(), outOfLineBits()->bits(), newOutOfLineBits->numWords() * sizeof(void*));
OutOfLineBits::destroy(outOfLineBits());
Modified: trunk/Source/WTF/wtf/BitVector.h (291026 => 291027)
--- trunk/Source/WTF/wtf/BitVector.h 2022-03-09 02:53:09 UTC (rev 291026)
+++ trunk/Source/WTF/wtf/BitVector.h 2022-03-09 02:56:37 UTC (rev 291027)
@@ -357,6 +357,8 @@
return byteCount(size());
}
+ WTF_EXPORT_PRIVATE void shiftRightByMultipleOf64(size_t);
+
private:
friend class JSC::CachedBitVector;
@@ -461,7 +463,7 @@
const OutOfLineBits* outOfLineBits() const { return bitwise_cast<const OutOfLineBits*>(m_bitsOrPointer << 1); }
OutOfLineBits* outOfLineBits() { return bitwise_cast<OutOfLineBits*>(m_bitsOrPointer << 1); }
- WTF_EXPORT_PRIVATE void resizeOutOfLine(size_t numBits);
+ WTF_EXPORT_PRIVATE void resizeOutOfLine(size_t numBits, size_t shiftInWords = 0);
WTF_EXPORT_PRIVATE void setSlow(const BitVector& other);
WTF_EXPORT_PRIVATE void mergeSlow(const BitVector& other);
Modified: trunk/Source/WTF/wtf/LikelyDenseUnsignedIntegerSet.h (291026 => 291027)
--- trunk/Source/WTF/wtf/LikelyDenseUnsignedIntegerSet.h 2022-03-09 02:53:09 UTC (rev 291026)
+++ trunk/Source/WTF/wtf/LikelyDenseUnsignedIntegerSet.h 2022-03-09 02:56:37 UTC (rev 291027)
@@ -1,5 +1,5 @@
/*
- * Copyright (C) 2021 Apple Inc. All Rights Reserved.
+ * Copyright (C) 2021, 2022 Apple Inc. All Rights Reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
@@ -39,8 +39,10 @@
// This is effectively a std::variant<HashSet, Pair<BitVector, IndexType>>
// If it is in BitVector mode, it keeps track of the minimum value in the set, and has the bitVector shifted by the same amount.
-// So for example {4000, 4002, 4003} would be represented as the bitVector 1101 with a m_min of 4000.
+// So for example {64000, 64002, 64003} would be represented as the bitVector 1101 with a m_min of 64000.
// It shifts between the two modes whenever that would at least halve its memory usage. So it will never use more than twice the optimal amount of memory, and yet should not ping-pong between the two modes too often.
+// As an optimization, instead of keeping track of the minimum value, it keeps track of the minimum value rounded down to the next multiple of 64.
+// This reduces repeated re-indexings of the bitvector when repeatedly adding a value just below the current minimum.
template<typename IndexType>
class LikelyDenseUnsignedIntegerSet {
WTF_MAKE_FAST_ALLOCATED;
@@ -111,18 +113,25 @@
if (!m_size) {
ASSERT(isBitVector());
- m_min = value;
+ m_min = value & ~63;
m_max = value;
m_size = 1;
- m_inline.bitVector.quickSet(0);
+ // not quickSet, as value - m_min might be 63, and the inline bit vector cannot store that value.
+ // So there might be some overflow here, forcing an allocation of an outline bit vector.
+ m_inline.bitVector.set(value - m_min);
return { true };
}
+ auto computeNewMin = [&]() {
+ IndexType roundedDownValue = value & ~63;
+ return std::min(m_min, roundedDownValue);
+ };
+
if (!isBitVector()) {
bool isNewEntry = m_inline.hashSet.add(value).isNewEntry;
if (!isNewEntry)
return { false };
- m_min = std::min(m_min, value);
+ m_min = computeNewMin();
m_max = std::max(m_max, value);
unsigned hashSetSize = m_inline.hashSet.capacity() * sizeof(IndexType);
unsigned wouldBeBitVectorSize = (m_max - m_min) / 8;
@@ -140,7 +149,7 @@
// We are in BitVector mode, and value is not in the bounds: we will definitely insert it as a new entry.
++m_size;
- IndexType newMin = std::min(m_min, value);
+ IndexType newMin = computeNewMin();
IndexType newMax = std::max(m_max, value);
unsigned bitVectorSize = (newMax - newMin) / 8;
unsigned wouldBeHashSetSize = estimateHashSetSize(m_size);
@@ -153,23 +162,15 @@
return { true };
}
- if (m_min <= value) {
- bool isNewEntry = !m_inline.bitVector.set(value - m_min);
- ASSERT_UNUSED(isNewEntry, isNewEntry);
- ASSERT(newMax == value);
- m_max = value;
- return { true };
+ if (value < m_min) {
+ ASSERT(newMin < m_min);
+ m_inline.bitVector.shiftRightByMultipleOf64(m_min - newMin);
+ m_min = newMin;
}
- BitVector newBitVector;
- newBitVector.resize(m_max - value + 1);
- IndexType shiftReduction = m_min - value;
- for (IndexType oldIndex : m_inline.bitVector)
- newBitVector.quickSet(oldIndex + shiftReduction);
- newBitVector.quickSet(0);
- m_inline.bitVector = WTFMove(newBitVector);
- ASSERT(newMin == value);
- m_min = value;
+ bool isNewEntry = !m_inline.bitVector.set(value - m_min);
+ ASSERT_UNUSED(isNewEntry, isNewEntry);
+ m_max = newMax;
return { true };
}
@@ -265,7 +266,9 @@
}
}
if (m_size) {
- RELEASE_ASSERT(m_min == min);
+ RELEASE_ASSERT(m_min <= min);
+ RELEASE_ASSERT(m_min + 64 > min);
+ RELEASE_ASSERT(!(m_min & 63));
RELEASE_ASSERT(m_max == max);
}
}
@@ -312,7 +315,6 @@
newBitVector.quickSet(oldValue - m_min);
++m_size;
}
- ASSERT(newBitVector.quickGet(0));
m_inline.hashSet.~Set();
new (NotNull, &m_inline.bitVector) BitVector(WTFMove(newBitVector));