Short answer: ti my knowledge, if you can make a string contain invalid 
codepoints, it is a bug and should be reported so that it can be fixed.

> On 15 Sep 2019, at 23:08, Darren Duncan <dar...@darrenduncan.net> wrote:
> 
> I'm defining an API that takes only well formed Str objects, meaning it would 
> only accept Str whose Unicode codepoints are all in the set 
> {0..0xD7FF,0xE000..0x10FFFF} and in particular there are no UTF-16 surrogate 
> characters, and it would do so as a yes/no stricture without coercing 
> anything outside of the set.
> 
> I am aware of how behind the scenes Perl 6 uses multiple levels of 
> abstraction for Str objects, and in particular may often use Normal Form G to 
> utilize codepoints above 0x10FFFF to be able to represent graphemes in 
> constant space.
> 
> I have a few questions:
> 
> 1. Do I even have to test the Str at all?  Does Perl 6 guarantee that all Str 
> are well formed, such that for example if one tried to decode UTF-16 that 
> contained invalid surrogate codepoints (single ones or ones not properly 
> paired up) that this would fail early, or is it possible that a Str could be 
> created without fuss that contains the invalid surrogates?  I suspect Perl 6' 
> inherent laziness would make passing through invalid codepoints more likely, 
> but perhaps that isn't the case.
> 
> 2. Does Perl 6 ever have Str that are not internally in some normal form?  
> That is, if a file contains say a mixture of NFC and NFD, the actual 
> codepoints will be preserved at the start until some operation requires them 
> to be in a normal form?  I'm thinking this may be a good case for laziness, 
> eg you don't need normal forms to just move data around, but it can help if 
> you want to count graphemes, so it only normalizes when such an operation 
> happens.
> 
> 3. If a Str can contain invalid surrogates or be wrong in some other way, 
> what is the best / most performant way to test that a Str is only valid?  
> Context is akin to a "Str where ..." and what we put in the "...".
> 
> 4. How can I get the actual codepoints from a Str without normalizing them 
> first?  I realize for typical use cases, explicitly using the NFC/NFD etc 
> methods, or "ords" which uses NFC, is the most correct, but if say I just 
> want what we already have, how would I do that?  I realize the result may not 
> be particularly useful in the face of NFG.
> 
> For a wider context, I know that in other programming languages like .NET or 
> Java it is possible for their strings to have invalid surrogates, and I'm 
> trying to figure out if Perl 6 can have the same problem or not.
> 
> Thank you.
> 
> -- Darren Duncan

Reply via email to