On Thu, Aug 30, 2018 at 6:00 PM, smaug <sm...@welho.com> wrote: > On 08/30/2018 11:21 AM, Henri Sivonen wrote: >> >> We have the following that a pattern in our code base: >> >> 1) SetCapacity(newCapacity) is called on an XPCOM string. >> 2) A pointer obtained from BeginWriting() is used for writing more >> than Length() but no more than newCapacity code units to the XPCOM >> string. >> 3) SetLength(actuallyWritten) is called on the XPCOM string such >> that actuallyWritten > Length() but actuallyWritten <= newCapacity. >> >> This is an encapsulation violation, because the string implementation >> doesn't know that the content past Length() is there. > > How so? The whole point of capacity is that the string has that much > capacity.
It has that much capacity for the use of the string implementation so that you can do a sequence of Append()s without reallocating multiple times. It doesn't mean that the caller is eligible to write into the internal structure of the string in an undocumented way. > The caller >> >> assumes that step #3 will not reallocate and will only write a >> zero-terminator at actuallyWritten and set mLength to actuallyWritten. >> (The pattern is common enough that clearly people have believed it to >> be a valid pattern. However, I haven't seen any in-tree or on-wiki >> string documentation endorsing the pattern.) >> >> It should be non-controversial that this is an encapsulation >> violation, > > Well, I'm not seeing any encapsulation violation ;) > > >> but does the encapsulation violation matter? It matters if >> we want SetLength() to be able to conserve memory by allocating a >> smaller buffer when actuallyWritten code units would fit in a smaller >> mozjemalloc bucket. > > > Please be very very careful when doing allocations and deallocations. They > are very slow, showing > up all the time in performance profiles. There is a threshold so that we don't reallocate from small to even smaller. There's a good chance that the threshold should be higher than it is now. > In order for the above pattern to work if >> >> SetLength() can reallocate in such case, SetLength() has to memcpy the >> whole buffer in case someone has written content that the string >> implementation is unaware of instead of just memcpying the part of the >> buffer that the string implementation knows to be in use. Pessimizing >> the number of code units to memcpy is bad for performance. >> >> It's unclear if trying to use a smaller mozjemalloc bucket it is a >> worthwhile thing. It obviously is for large long-lived strings and it >> obviously isn't for short-lived strings even if they are large. >> SetLength() doesn't know what the future holds for the string. :-( But >> in any case, it's bad if we can't make changes that are sound from the >> perspective of the string encapsulation, because we have other code >> violating the encapsulation. >> >> After the soft freeze, I'd like to change things so that we memcpy >> only the part of the buffer that the string implementation knows is in >> use. To that end, we should stop using the above pattern that is an >> encapsulation violation. >> > What is then the point of SetCapacity anymore? To avoid multiple allocations during a sequence of Append()s. (This is documented on the header.) >> For m-c, I've filed >> https://bugzilla.mozilla.org/show_bug.cgi?id=1472113 and worked on >> fixing the bugs that block it. For c-c, I've filed >> https://bugzilla.mozilla.org/show_bug.cgi?id=1486706 but don't intend >> to do the work of investigating or fixing the string usage in c-c. >> >> As for fixing the above pattern, there are two alternatives. The first one >> is: >> >> 1) SetLength(newCapacity) >> 2) BeginWriting() >> 3) Truncate(actuallyWritten) (or SetLength(actuallyWritten), Truncate >> simply tells to the reader that the string isn't being made longer) >> >> With this pattern, writing happens to the part of the buffer that the >> string implementation believes to be in use. This has the downside >> that the first SetLength() call (like, counter-intuitively, >> SetCapacity() currently!) writes the zero terminator, which from the >> point of view of CPU caches is an out-of-the-blue write that's not >> part of a well-behaved forward-only linear write pattern and not >> necessarily near recently-accessed locations. >> >> The second alternative is BulkWrite() in C++ and bulk_write() in Rust. > > The API doesn't seem to be too user friendly. Zero-termination is sad. An API that doesn't zero-terminate eagerly but maintains some kind of encapsulation necessarily leads to a complex API. If you don't care about how capacity was rounded up and don't care about avoiding cache-unfriendly eager zero termination, you can use the other pattern suggested above, which is almost like the old pattern. -- Henri Sivonen hsivo...@mozilla.com _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform