Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Henri Sivonen Mon, 24 Nov 2014 00:37:09 -0800

On Thu, Oct 9, 2014 at 2:06 PM, Nicholas Nethercote
<n.netherc...@gmail.com> wrote:
> On Thu, Oct 9, 2014 at 9:21 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
>> On Wed, Oct 8, 2014 at 4:13 PM, Jan de Mooij <jandemo...@gmail.com> wrote:
>>
>> Has SpiderMonkey ever been instrumented to find out if most strings
>> are even just ASCII?
>
> There are some measurements in
> https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/.
>
> But even better, you can visit about:memory and see for yourself. Look
> for entries like this:
>
> │   ├──26.43 MB (08.36%) -- strings
> │   │  ├──13.98 MB (04.43%) -- malloc-heap
> │   │  │  ├──10.84 MB (03.43%) ── latin1
> │   │  │  └───3.14 MB (00.99%) ── two-byte
> │   │  └──12.45 MB (03.94%) -- gc-heap
> │   │     ├───9.05 MB (02.86%) ── latin1
> │   │     └───3.40 MB (01.08%) ── two-byte

Both the blog post and about:memory report latin1 vs. other but don't
report ASCII vs. other.

On Fri, Oct 10, 2014 at 9:18 PM, Jan de Mooij <jandemo...@gmail.com> wrote:
> On Thu, Oct 9, 2014 at 12:21 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
>>
>> It would be even more tragic to miss the opportunity to use 8-bit code
>> units for strings in Servo because JS crypto benchmarks use strings.
>> What chances are there to retire the use of strings-for-crypto in
>> benchmarking? Such a benchmark doesn't represent a reasonable real
>> application. A reasonable real application would use the Web Crypto
>> API to delegate crypto operations to outside the JS engine or use
>> ArrayBuffers to perform byte-oriented operations inside the JS engine.
>
> Absolutely, but there are outdated/crappy benchmarks like Sunspider that we
> can't regress too much to avoid bad press.

When Sunspider measuring silly things hurts all browser vendors, is
there any chance of each browser vendor in its own corner deciding to
stop advertising Sunspider scores (e.g. hiding Sunspider from public
view at http://arewefastyet.com/ ) and then evangelizing the press to
stop using Sunspider scores?

(Or, alternatively, doing something so ludicrous to recognize
Sunspider and giving precomputed results that the press no longer want
to use Sunspider for anything?)

>> Besides charAt/charCodeAt, what operations do you expect to be
>> adversely affected by WTF-8 memory layout?
>
>
> Operations like indexOf, replace, regular expressions matter a lot for
> certain benchmarks.

As noted, UTF-16 vs. UTF-8 does not affect the regularity of regular
expressions, so there shouldn't be any reason why, with effort,
regular expressions could be made fast for UTF-8 when it is time to
put serious productization effort into Servo. (As long as Servo is
research, it shouldn't matter that much--since, as noted, "can UTF-8
regexps be fast?" is not an open research question. The answer is
"yes".)

Replace only exposes UTF-16 indeces when the replacement is a function
argument, right? Is that variant benchmark-sensitive? I wonder if the
compiler could check if the offset argument is ignored by the
function.

Speaking of the compiler looking at the calling context to decide if
UTF-16 semantics are really needed, foo.indexOf(bar) != -1 is a really
common pattern that could be optimized not to care about UTF-16
indeces. Furthermore, it seems that Boyer–Moore–Horspool, where
knowing the UTF-16 skip lengths of parts of the haystack would be a
problem, is currently only used in some cases:
http://dxr.mozilla.org/mozilla-central/source/js/src/jsstr.cpp#1297 .
If the benchmarks mostly exercise string lengths that use linear
search, returning UTF-16 indeces with UTF-8 underlying data is
feasible.

> An ASCII-only bit would probably help most of those
> benchmarks though...

Right.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Reply via email to