Re: Use cases for invalid-Unicode atoms

Henri Sivonen Tue, 20 Mar 2018 06:51:01 -0700

On Tue, Mar 20, 2018 at 12:44 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
> On Tue, Mar 20, 2018 at 11:12 AM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
>> OK. I'll leave the UTF-16 case unchanged and will make the minimal
>> changes on the UTF-8 side to retain the existing outward behavior
>> without burning the tree. Hopefully I can make the UTF-8 case faster
>> while at it. It depended on not-so-great code.
>
> I still have doubts about retaining the exact invalid-UTF-8 behavior.
> The current behavior appears to be that if we try to atomicize an
> invalid UTF-8 string, the returned atom is new atom representing the
> empty string--not the pre-existing atom for the empty string.
>
> Is there a reason why it's desirable behavior to potentially have
> multiple atoms representing the empty string? Is there a reason why we
> don't MOZ_CRASH on invalid UTF-8 if we are convinced enough that it's
> not supposed to happen to the point that we let go of the atomicity of
> atoms if it does happen?


Furthermore, we validate UTF-8 strings anyway as a side effect of
hashing them as if they were UTF-16, so if we don't want to MOZ_CRASH,
we could at that point swap a valid string (invalid byte sequences
replaced with U+FFFD) in the input string's place and atomicize that.
It could be a MOZ_UNLIKELY branch on the validation result that we
compute anyway and would avoid the weirdness of non-atomic atoms that
have no resemblance to the input string.

Thoughts?

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Use cases for invalid-Unicode atoms

Reply via email to