I'm defining an API that takes only well formed Str objects, meaning it would
only accept Str whose Unicode codepoints are all in the set
{0..0xD7FF,0xE000..0x10FFFF} and in particular there are no UTF-16 surrogate
characters, and it would do so as a yes/no stricture without coercing anything
outside of the set.
I am aware of how behind the scenes Perl 6 uses multiple levels of abstraction
for Str objects, and in particular may often use Normal Form G to utilize
codepoints above 0x10FFFF to be able to represent graphemes in constant space.
I have a few questions:
1. Do I even have to test the Str at all? Does Perl 6 guarantee that all Str
are well formed, such that for example if one tried to decode UTF-16 that
contained invalid surrogate codepoints (single ones or ones not properly paired
up) that this would fail early, or is it possible that a Str could be created
without fuss that contains the invalid surrogates? I suspect Perl 6' inherent
laziness would make passing through invalid codepoints more likely, but perhaps
that isn't the case.
2. Does Perl 6 ever have Str that are not internally in some normal form? That
is, if a file contains say a mixture of NFC and NFD, the actual codepoints will
be preserved at the start until some operation requires them to be in a normal
form? I'm thinking this may be a good case for laziness, eg you don't need
normal forms to just move data around, but it can help if you want to count
graphemes, so it only normalizes when such an operation happens.
3. If a Str can contain invalid surrogates or be wrong in some other way, what
is the best / most performant way to test that a Str is only valid? Context is
akin to a "Str where ..." and what we put in the "...".
4. How can I get the actual codepoints from a Str without normalizing them
first? I realize for typical use cases, explicitly using the NFC/NFD etc
methods, or "ords" which uses NFC, is the most correct, but if say I just want
what we already have, how would I do that? I realize the result may not be
particularly useful in the face of NFG.
For a wider context, I know that in other programming languages like .NET or
Java it is possible for their strings to have invalid surrogates, and I'm trying
to figure out if Perl 6 can have the same problem or not.
Thank you.
-- Darren Duncan