Short answer: ti my knowledge, if you can make a string contain invalid codepoints, it is a bug and should be reported so that it can be fixed.
> On 15 Sep 2019, at 23:08, Darren Duncan <dar...@darrenduncan.net> wrote: > > I'm defining an API that takes only well formed Str objects, meaning it would > only accept Str whose Unicode codepoints are all in the set > {0..0xD7FF,0xE000..0x10FFFF} and in particular there are no UTF-16 surrogate > characters, and it would do so as a yes/no stricture without coercing > anything outside of the set. > > I am aware of how behind the scenes Perl 6 uses multiple levels of > abstraction for Str objects, and in particular may often use Normal Form G to > utilize codepoints above 0x10FFFF to be able to represent graphemes in > constant space. > > I have a few questions: > > 1. Do I even have to test the Str at all? Does Perl 6 guarantee that all Str > are well formed, such that for example if one tried to decode UTF-16 that > contained invalid surrogate codepoints (single ones or ones not properly > paired up) that this would fail early, or is it possible that a Str could be > created without fuss that contains the invalid surrogates? I suspect Perl 6' > inherent laziness would make passing through invalid codepoints more likely, > but perhaps that isn't the case. > > 2. Does Perl 6 ever have Str that are not internally in some normal form? > That is, if a file contains say a mixture of NFC and NFD, the actual > codepoints will be preserved at the start until some operation requires them > to be in a normal form? I'm thinking this may be a good case for laziness, > eg you don't need normal forms to just move data around, but it can help if > you want to count graphemes, so it only normalizes when such an operation > happens. > > 3. If a Str can contain invalid surrogates or be wrong in some other way, > what is the best / most performant way to test that a Str is only valid? > Context is akin to a "Str where ..." and what we put in the "...". > > 4. How can I get the actual codepoints from a Str without normalizing them > first? I realize for typical use cases, explicitly using the NFC/NFD etc > methods, or "ords" which uses NFC, is the most correct, but if say I just > want what we already have, how would I do that? I realize the result may not > be particularly useful in the face of NFG. > > For a wider context, I know that in other programming languages like .NET or > Java it is possible for their strings to have invalid surrogates, and I'm > trying to figure out if Perl 6 can have the same problem or not. > > Thank you. > > -- Darren Duncan