Re: Extending the length of an XPCOM string when writing to it via a raw pointer

Henri Sivonen Thu, 30 Aug 2018 09:44:22 -0700

On Thu, Aug 30, 2018 at 6:00 PM, smaug <sm...@welho.com> wrote:
> On 08/30/2018 11:21 AM, Henri Sivonen wrote:
>>
>> We have the following that a pattern in our code base:
>>
>>   1) SetCapacity(newCapacity) is called on an XPCOM string.
>>   2) A pointer obtained from BeginWriting() is used for writing more
>> than Length() but no more than newCapacity code units to the XPCOM
>> string.
>>   3) SetLength(actuallyWritten)  is called on the XPCOM string such
>> that actuallyWritten > Length() but actuallyWritten <= newCapacity.
>>
>> This is an encapsulation violation, because the string implementation
>> doesn't know that the content past Length() is there.
>
> How so? The whole point of capacity is that the string has that much
> capacity.


It has that much capacity for the use of the string implementation so
that you can do a sequence of Append()s without reallocating multiple
times. It doesn't mean that the caller is eligible to write into the
internal structure of the string in an undocumented way.

>  The caller
>>
>> assumes that step #3 will not reallocate and will only write a
>> zero-terminator at actuallyWritten and set mLength to actuallyWritten.
>> (The pattern is common enough that clearly people have believed it to
>> be a valid pattern. However, I haven't seen any in-tree or on-wiki
>> string documentation endorsing the pattern.)
>>
>> It should be non-controversial that this is an encapsulation
>> violation,
>
> Well, I'm not seeing any encapsulation violation ;)
>
>
>> but does the encapsulation violation matter? It matters if
>> we want SetLength() to be able to conserve memory by allocating a
>> smaller buffer when actuallyWritten code units would fit in a smaller
>> mozjemalloc bucket.
>
>
> Please be very very careful when doing allocations and deallocations. They
> are very slow, showing
> up all the time in performance profiles.

There is a threshold so that we don't reallocate from small to even
smaller. There's a good chance that the threshold should be higher
than it is now.

>  In order for the above pattern to work if
>>
>> SetLength() can reallocate in such case, SetLength() has to memcpy the
>> whole buffer in case someone has written content that the string
>> implementation is unaware of instead of just memcpying the part of the
>> buffer that the string implementation knows to be in use. Pessimizing
>> the number of code units to memcpy is bad for performance.
>>
>> It's unclear if trying to use a smaller mozjemalloc bucket it is a
>> worthwhile thing. It obviously is for large long-lived strings and it
>> obviously isn't for short-lived strings even if they are large.
>> SetLength() doesn't know what the future holds for the string. :-( But
>> in any case, it's bad if we can't make changes that are sound from the
>> perspective of the string encapsulation, because we have other code
>> violating the encapsulation.
>>
>> After the soft freeze, I'd like to change things so that we memcpy
>> only the part of the buffer that the string implementation knows is in
>> use. To that end, we should stop using the above pattern that is an
>> encapsulation violation.
>>
> What is then the point of SetCapacity anymore?

To avoid multiple allocations during a sequence of Append()s. (This is
documented on the header.)

>> For m-c, I've filed
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1472113 and worked on
>> fixing the bugs that block it. For c-c, I've filed
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1486706 but don't intend
>> to do the work of investigating or fixing the string usage in c-c.
>>
>> As for fixing the above pattern, there are two alternatives. The first one
>> is:
>>
>>   1) SetLength(newCapacity)
>>   2) BeginWriting()
>>   3) Truncate(actuallyWritten) (or SetLength(actuallyWritten), Truncate
>> simply tells to the reader that the string isn't being made longer)
>>
>> With this pattern, writing happens to the part of the buffer that the
>> string implementation believes to be in use. This has the downside
>> that the first SetLength() call (like, counter-intuitively,
>> SetCapacity() currently!) writes the zero terminator, which from the
>> point of view of CPU caches is an out-of-the-blue write that's not
>> part of a well-behaved forward-only linear write pattern and not
>> necessarily near recently-accessed locations.
>>
>> The second alternative is BulkWrite() in C++ and bulk_write() in Rust.
>
> The API doesn't seem to be too user friendly.

Zero-termination is sad. An API that doesn't zero-terminate eagerly
but maintains some kind of encapsulation necessarily leads to a
complex API. If you don't care about how capacity was rounded up and
don't care about avoiding cache-unfriendly eager zero termination, you
can use the other pattern suggested above, which is almost like the
old pattern.

-- 
Henri Sivonen
hsivo...@mozilla.com
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Extending the length of an XPCOM string when writing to it via a raw pointer

Reply via email to