Re: Comparing inputs with source strings
Karl Williamson writes: > On 05/09/2016 08:53 AM, Daniel Dehennin wrote: >> Hello, >> >> I tried to make my Perl5 code unicode compliant after reading a post on >> stackoverflow[1]. >> >> As suggested in the post: >> >> “always run incoming stuff through NFD and outbound stuff from NFC.” >> >> I got a hard time finding why my Test::More was failing but displaying >> exactly the same strings for “got” and “expected”. >> >> I finally check how UTF-8 sources are handled and found that they are in >> NFC form, I run the following script: [...] > I'm afraid that when it comes to normalization in Perl5, you have to > do it yourself. I hear that Perl6 is much friendlier in this regard, > but I have no personal experience with it. Your $unistring is in > whatever normalization you made it when you typed it into your editor, > or whatever your editor did with it as you were typing. You could > have typed it in NFD, but probably the most natural way to enter > things on your keyboard will underlying it all be NFC. That's what I finally find out in another post, normally all my inputs are NFD but my tests used static string to match, I declared them with NFD to make it explicit. I added a note in my POD to signal that the sub returns NFD strings. > Normalization is tricky, and the Unicode Consortium has had to modify > things years after they were first specified, because no one could > reasonably implement what was expected. I may tackle getting > normalization to be more developer friendly in future Perl5 versions, > but not in the next couple of years. Thanks, as soon as my little work project is working well I'll try to redo it in Perl6. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature
Re: Comparing inputs with source strings
Daniel Dehennin writes: [...] > I can't imagine declaring all my static string variable with: > > my unistring = NFD('C’est une chaîne unicode'); Hey hey, it's more complicated than that, it depends on how the source was encoded, the following match none of the forms: 'C’est une chaîne unicode avec É' Since “É” is “\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}” So, it looks like no normalisation is done on sources. Regards. -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature
Re: Comparing inputs with source strings
Daniel Dehennin writes: > Hello, > > I tried to make my Perl5 code unicode compliant after reading a post on > stackoverflow[1]. > > As suggested in the post: > > “always run incoming stuff through NFD and outbound stuff from NFC.” The same from perlunicode[1]: “The usual advice is to convert your inputs to NFD before processing further” I can't imagine declaring all my static string variable with: my unistring = NFD('C’est une chaîne unicode'); Regards. Footnotes: [1] http://perldoc.perl.org/perlunicode.html -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature
Comparing inputs with source strings
Hello, I tried to make my Perl5 code unicode compliant after reading a post on stackoverflow[1]. As suggested in the post: “always run incoming stuff through NFD and outbound stuff from NFC.” I got a hard time finding why my Test::More was failing but displaying exactly the same strings for “got” and “expected”. I finally check how UTF-8 sources are handled and found that they are in NFC form, I run the following script: #+begin_src perl #!/usr/bin/env perl use utf8; use warnings; use Test::More; use Unicode::Normalize; my $unistring = 'C’est une chaîne unicode'; my @forms = ("NFD", "NFC", "NFKD", "NFKC"); for my $form (@forms) { if ($unistring eq &$form($unistring)) { print "UTF-8 source is in form '$form'\n"; } } #+end_src and got: #+begin_src UTF-8 source is in form 'NFC' UTF-8 source is in form 'NFKC' #+end_src So, the Test::More::is_deeply was trying to compare an input in NFD with the expected string in NFC. My code can use Unicode::Collate, but for all the code I did not write I wonder if there is a way to handle it cleanly. Or maybe I'm doing something wrong? Regards. Footnotes: [1] https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default -- Daniel Dehennin Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF signature.asc Description: PGP signature