Re: Comparing inputs with source strings

2016-05-11 Thread Daniel Dehennin
Karl Williamson  writes:

> On 05/09/2016 08:53 AM, Daniel Dehennin wrote:
>> Hello,
>>
>> I tried to make my Perl5 code unicode compliant after reading a post on
>> stackoverflow[1].
>>
>> As suggested in the post:
>>
>>  “always run incoming stuff through NFD and outbound stuff from NFC.”
>>
>> I got a hard time finding why my Test::More was failing but displaying
>> exactly the same strings for “got” and “expected”.
>>
>> I finally check how UTF-8 sources are handled and found that they are in
>> NFC form, I run the following script:

[...]

> I'm afraid that when it comes to normalization in Perl5, you have to
> do it yourself.  I hear that Perl6 is much friendlier in this regard,
> but I have no personal experience with it.  Your $unistring is in
> whatever normalization you made it when you typed it into your editor,
> or whatever your editor did with it as you were typing.  You could
> have typed it in NFD, but probably the most natural way to enter
> things on your keyboard will underlying it all be NFC.

That's what I finally find out in another post, normally all my inputs
are NFD but my tests used static string to match, I declared them with
NFD to make it explicit.

I added a note in my POD to signal that the sub returns NFD strings.

> Normalization is tricky, and the Unicode Consortium has had to modify
> things years after they were first specified, because no one could
> reasonably implement what was expected.  I may tackle getting
> normalization to be more developer friendly in future Perl5 versions,
> but not in the next couple of years.

Thanks, as soon as my little work project is working well I'll try to
redo it in Perl6.

Regards.

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature


Re: Comparing inputs with source strings

2016-05-10 Thread Daniel Dehennin
Daniel Dehennin  writes:


[...]

> I can't imagine declaring all my static string variable with:
>
> my unistring = NFD('C’est une chaîne unicode');

Hey hey, it's more complicated than that, it depends on how the source
was encoded, the following match none of the forms:

'C’est une chaîne unicode avec É'

Since “É” is “\N{LATIN CAPITAL LETTER E}\N{COMBINING ACUTE ACCENT}”

So, it looks like no normalisation is done on sources.

Regards.
-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature


Re: Comparing inputs with source strings

2016-05-10 Thread Daniel Dehennin
Daniel Dehennin  writes:

> Hello,
>
> I tried to make my Perl5 code unicode compliant after reading a post on
> stackoverflow[1].
>
> As suggested in the post:
>
> “always run incoming stuff through NFD and outbound stuff from NFC.”

The same from perlunicode[1]:

“The usual advice is to convert your inputs to NFD before processing
further”

I can't imagine declaring all my static string variable with:

my unistring = NFD('C’est une chaîne unicode');

Regards.

Footnotes: 
[1]  http://perldoc.perl.org/perlunicode.html

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature


Comparing inputs with source strings

2016-05-09 Thread Daniel Dehennin
Hello,

I tried to make my Perl5 code unicode compliant after reading a post on
stackoverflow[1].

As suggested in the post:

“always run incoming stuff through NFD and outbound stuff from NFC.”

I got a hard time finding why my Test::More was failing but displaying
exactly the same strings for “got” and “expected”.

I finally check how UTF-8 sources are handled and found that they are in
NFC form, I run the following script:

#+begin_src perl
#!/usr/bin/env perl

use utf8;
use warnings;

use Test::More;
use Unicode::Normalize;

my $unistring = 'C’est une chaîne unicode';

my @forms = ("NFD", "NFC", "NFKD", "NFKC");

for my $form (@forms) {
if ($unistring eq &$form($unistring)) {
print "UTF-8 source is in form '$form'\n";
}
}
#+end_src

and got:

#+begin_src
UTF-8 source is in form 'NFC'
UTF-8 source is in form 'NFKC'
#+end_src

So, the Test::More::is_deeply was trying to compare an input in NFD with
the expected string in NFC.

My code can use Unicode::Collate, but for all the code I did not write I
wonder if there is a way to handle it cleanly.

Or maybe I'm doing something wrong?

Regards.

Footnotes: 
[1]  
https://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

-- 
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF


signature.asc
Description: PGP signature