Hi El 13/01/17 a las 11:34, victory escribió: > > first, it is stupid to blame about names which are valid. > it is also stupid that taking care of each occurrences coming up. > as pages are all utf-8 now, no need to keep such references, > this patch restores original characters instead of numeric references > > patch below: > Index: english/international/l10n/scripts/gen-files.pl > =================================================================== > --- english/international/l10n/scripts/gen-files.pl (revision 232) > +++ english/international/l10n/scripts/gen-files.pl (working copy) > @@ -3,6 +3,7 @@ > use strict; > use File::Path; > use Getopt::Long; > +use Encode qw(encode); > > use lib ($0 =~ m|(.*)/|, $1 or ".") ."/../../../../Perl"; > > @@ -117,8 +118,7 @@ > $name =~ s/\s*<.*//; > $name =~ s/&(?!#)/&/g; > $name =~ s/=\?.*?\?=//g; > - # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. > - $name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; > + $name =~ s/&#(\d+);/encode("UTF-8",chr($1))/ge; > $name = 'DDTP' if $name eq 'Debian Description Translation Project'; > $name = '' if $name =~ m/\@/; > return $name; > >
Thanks for all the work in these and other validation/tidy issues in the website. I've done some tests and I'm afraid I cannot merge the patch yet. Using perl to encode to UTF8 as you propose makes tidy happy, but there is another script passed to the files, "validate", that produces theses errors: Line 10, character 12: non SGML character number 130 If we use numeric entities, tidy complains for ‚ unless we suppress the character as we do now. For the emoji in translator name, "validate" complains in any case: * Using numeric entities: with the current message received: "128513" is not a character number in the document character set * Encoding to UTF8 as the proposed patch: Line 10, character 29: non SGML character number 65533 I've produced two small files: https://cosas.larjona.net/validate.utf8.html https://cosas.larjona.net/validate.ncr.html and passed the online validator in https://validator.w3.org/ I'll try to see if we can use https://validator.w3.org/source/ and get better "tidy" and "validate" tools from there. For now, I've fixed the comment in the gen-files.pl: --- english/international/l10n/scripts/gen-files.pl 20 May 2016 21:15:45 -0000 1.97 +++ english/international/l10n/scripts/gen-files.pl 14 Jan 2017 12:41:06 -0000 @@ -117,7 +117,10 @@ $name =~ s/\s*<.*//; $name =~ s/&(?!#)/&/g; $name =~ s/=\?.*?\?=//g; - # BREAK PERMITTED HERE (U+0082) is not allowed in HTML 4.01. + # BREAK PERMITTED HERE (U+0082) is allowed in HTML 4.01. + # but the "tidy" tool that we use complains about them, + # so we just remove those characters for now, until better solution + # see Bug #820119 $name =~ s/(?:�*130;|�*82;|\N{U+0082})//ig; $name = 'DDTP' if $name eq 'Debian Description Translation Project'; $name = '' if $name =~ m/\@/; Best regards -- Laura Arjona Reina https://wiki.debian.org/LauraArjona