Bug#820119: tidy reports valid NCR as invalid
Sorry, I should have used my Debian email address on that last update. I'm the maintainer of the opensp package which provides the onsgmls executable used by validate. -- Neil Roeth
Bug#820119: tidy reports valid NCR as invalid
Laura asked for my help on this issue. What I found is that setting the environment variable SP_CHARSET_FIXED to 1 makes the onsgmls program use the Unicode 2.0 character set, as the referenced web page says. However, it uses only the first 65536 characters (the iso10646-ucs-2 character set), so character number 128513 triggers the error since it is outside that range. In order to make that work, you need to ensure SP_CHARSET_FIXED is unset in the validate script. However, XML files need SP_CHARSET_FIXED set. So, I suggest something like this (patch attached): if ($xhtml{$htmlLevel}) { $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog; $ENV{'SP_CHARSET_FIXED'} = 1; $ENV{'SP_ENCODING'} = 'xml'; } else { $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog; if (defined $charset) { $ENV{'SP_BCTF'} = $charset; } else { $ENV{'SP_BCTF'} = "utf-8"; } } That also changes the default character set for HTML from ISO-8859-1 to UTF-8 because the former is not a valid BCTF option. It appears the validate script only uses that default if there is not a character set defined in the HTML file itself and there is no character set option passed to the script. I didn't set up the whole web site build on my machine to test if this change has any negative effects on pages other than en_GB.it.html , so it needs broader testing. diff --git a/scripts/validate b/scripts/validate index 7d20f1c..a41c1cb 100755 --- a/scripts/validate +++ b/scripts/validate @@ -364,16 +364,16 @@ foreach $file (@files) { # environment accordingly. if ($xhtml{$htmlLevel}) { $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog; + $ENV{'SP_CHARSET_FIXED'} = 1; $ENV{'SP_ENCODING'} = 'xml'; } else { $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog; if (defined $charset) { -$ENV{'SP_ENCODING'} = $charset; +$ENV{'SP_BCTF'} = $charset; } else { -$ENV{'SP_ENCODING'} = "ISO-8859-1"; +$ENV{'SP_BCTF'} = "utf-8"; } } -$ENV{'SP_CHARSET_FIXED'} = 1; if ($verbose) { if ($file eq '-') {
Bug#820119: tidy reports valid NCR as invalid
Control: tag -1 + patch Dear all, At Wed, 6 Apr 2016 20:48:15 +0200, Frank Lichtenheld wrote: > > 2016-04-06 18:52 GMT+02:00 victory: > > On Tue, 5 Apr 2016 20:16:53 +0200 > > Frank Lichtenheld wrote: > > > >> I assume you wanted to report this against tidy, not www.debian.org? > > > > if so, I always report to the upstream, not the debian's one > > > > see https://www-master.debian.org/build-logs/tidy/ > > files w/ 142bytes are caused by the issue > > (other langs do not have the page [international/l10n/po/pl]) > > Okay, that paragraph would have been helpful in the original mail to > understand the contexts of your statement. In HTML, "BREAK PERMITTED HERE" + "SPACE" can be rewritten by " ". Because line break is automatically added by browser between words separated by space. How about apply this patch? This patch replaces " " in translator's name to " ". The patch is not fully tested, but I hope it will work. Sincerely yours, Ryuunosuke Ayanokouzi -- AYANOKOUZI, Ryuunosuke fix_NCR_130.patch Description: Binary data pgpLDU8GJJcEN.pgp Description: OpenPGP Digital Signature
Bug#820119: tidy reports valid NCR as invalid
2016-04-06 18:52 GMT+02:00 victory: > On Tue, 5 Apr 2016 20:16:53 +0200 > Frank Lichtenheld wrote: > >> I assume you wanted to report this against tidy, not www.debian.org? > > if so, I always report to the upstream, not the debian's one > > see https://www-master.debian.org/build-logs/tidy/ > files w/ 142bytes are caused by the issue > (other langs do not have the page [international/l10n/po/pl]) Okay, that paragraph would have been helpful in the original mail to understand the contexts of your statement. Regards, Frank -- Frank Lichtenheld
Bug#820119: tidy reports valid NCR as invalid
On Tue, 5 Apr 2016 20:16:53 +0200 Frank Lichtenheld wrote: > I assume you wanted to report this against tidy, not www.debian.org? if so, I always report to the upstream, not the debian's one see https://www-master.debian.org/build-logs/tidy/ files w/ 142bytes are caused by the issue (other langs do not have the page [international/l10n/po/pl]) as this is an issue about managing the site, you have some choices: 1) fix tidy (upstream or package) and use appropriate option 2) eliminates tidy's output (pipe to sed or use local modified tidy) 3) tamper with the po file 4) ignore this and accept the current situation forever (until the Last-Translator changed) -- victory no need to CC me :-)
Bug#820119: tidy reports valid NCR as invalid
2016-04-05 18:12 GMT+02:00 victory: > > Package: www.debian.org I assume you wanted to report this against tidy, not www.debian.org? > Severity: wishlist > > https://www.w3.org/International/questions/qa-controls#support > HTML, XHTML and XML 1.0 do not support the C0 range, > except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, > and CR (Carriage Return) U+000D. > The C1 range is supported, i.e. you can encode the controls directly > or represent them as NCRs (Numeric Character References). > > * > https://www.w3.org/International/questions/qa-controls#background > The control codes in the range U+0080-U+009F are known as the "C1" range. > > unfortunately no option seems to eliminate this :( > latest source use the same code (line 1165-) > https://github.com/htacg/tidy-html5/blob/master/src/lexer.c > > > -- > victory > no need to CC me :-) > -- Frank Lichtenheld
Bug#820119: tidy reports valid NCR as invalid
Package: www.debian.org Severity: wishlist https://www.w3.org/International/questions/qa-controls#support HTML, XHTML and XML 1.0 do not support the C0 range, except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, and CR (Carriage Return) U+000D. The C1 range is supported, i.e. you can encode the controls directly or represent them as NCRs (Numeric Character References). * https://www.w3.org/International/questions/qa-controls#background The control codes in the range U+0080-U+009F are known as the "C1" range. unfortunately no option seems to eliminate this :( latest source use the same code (line 1165-) https://github.com/htacg/tidy-html5/blob/master/src/lexer.c -- victory no need to CC me :-)