Bug#820119: tidy reports valid NCR as invalid

2017-11-07 Thread Neil Roeth
Sorry, I should have used my Debian email address on that last update. 
I'm the maintainer of the opensp package which provides the onsgmls
executable used by validate.

-- 
Neil Roeth



Bug#820119: tidy reports valid NCR as invalid

2017-11-07 Thread Neil Roeth
Laura asked for my help on this issue.  What I found is that setting the
environment variable SP_CHARSET_FIXED to 1 makes the onsgmls program use
the Unicode 2.0 character set, as the referenced web page says. 
However, it uses only the first 65536 characters (the iso10646-ucs-2
character set), so character number 128513 triggers the error since it
is outside that range.  In order to make that work, you need to ensure
SP_CHARSET_FIXED is unset in the validate script.  However, XML files
need SP_CHARSET_FIXED set.  So, I suggest something like this (patch
attached):

    if ($xhtml{$htmlLevel}) {
    $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
    $ENV{'SP_CHARSET_FIXED'} = 1;
    $ENV{'SP_ENCODING'} = 'xml';
    } else {
    $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
    if (defined $charset) {
    $ENV{'SP_BCTF'} = $charset;
    } else {
    $ENV{'SP_BCTF'} = "utf-8";
    }
    }

That also changes the default character set for HTML from ISO-8859-1 to
UTF-8 because the former is not a valid BCTF option.  It appears the
validate script only uses that default if there is not a character set
defined in the HTML file itself and there is no character set option
passed to the script.

I didn't set up the whole web site build on my machine to test if this
change has any negative effects on pages other than en_GB.it.html , so
it needs broader testing.


diff --git a/scripts/validate b/scripts/validate
index 7d20f1c..a41c1cb 100755
--- a/scripts/validate
+++ b/scripts/validate
@@ -364,16 +364,16 @@ foreach $file (@files) {
 # environment accordingly.
 if ($xhtml{$htmlLevel}) {
 $ENV{'SGML_CATALOG_FILES'} = $xhtmlCatalog;
+	$ENV{'SP_CHARSET_FIXED'} = 1;
 $ENV{'SP_ENCODING'} = 'xml';
 } else {
 $ENV{'SGML_CATALOG_FILES'} = $htmlCatalog;
 if (defined $charset) {
-$ENV{'SP_ENCODING'} = $charset;
+$ENV{'SP_BCTF'} = $charset;
 } else {
-$ENV{'SP_ENCODING'} = "ISO-8859-1";
+$ENV{'SP_BCTF'} = "utf-8";
 }
 }
-$ENV{'SP_CHARSET_FIXED'} = 1;
 
 if ($verbose) {
 if ($file eq '-') {


Bug#820119: tidy reports valid NCR as invalid

2016-04-07 Thread AYANOKOUZI, Ryuunosuke
Control: tag -1 + patch

Dear all,

At Wed, 6 Apr 2016 20:48:15 +0200,
Frank Lichtenheld wrote:
>
> 2016-04-06 18:52 GMT+02:00 victory :
> > On Tue, 5 Apr 2016 20:16:53 +0200
> > Frank Lichtenheld wrote:
> >
> >> I assume you wanted to report this against tidy, not www.debian.org?
> >
> > if so, I always report to the upstream, not the debian's one
> >
> > see https://www-master.debian.org/build-logs/tidy/
> > files w/ 142bytes are caused by the issue
> > (other langs do not have the page [international/l10n/po/pl])
>
> Okay, that paragraph would have been helpful in the original mail to
> understand the contexts of your statement.

In HTML, "BREAK PERMITTED HERE" + "SPACE" can be rewritten by " ".
Because line break is automatically added by browser
between words separated by space.

How about apply this patch?
This patch replaces " " in translator's name to " ".
The patch is not fully tested, but I hope it will work.

Sincerely yours,
Ryuunosuke Ayanokouzi
--
AYANOKOUZI, Ryuunosuke 


fix_NCR_130.patch
Description: Binary data


pgpLDU8GJJcEN.pgp
Description: OpenPGP Digital Signature


Bug#820119: tidy reports valid NCR as invalid

2016-04-06 Thread Frank Lichtenheld
2016-04-06 18:52 GMT+02:00 victory :
> On Tue, 5 Apr 2016 20:16:53 +0200
> Frank Lichtenheld wrote:
>
>> I assume you wanted to report this against tidy, not www.debian.org?
>
> if so, I always report to the upstream, not the debian's one
>
> see https://www-master.debian.org/build-logs/tidy/
> files w/ 142bytes are caused by the issue
> (other langs do not have the page [international/l10n/po/pl])

Okay, that paragraph would have been helpful in the original mail to
understand the contexts of your statement.

Regards,
  Frank

-- 
Frank Lichtenheld 



Bug#820119: tidy reports valid NCR as invalid

2016-04-06 Thread victory
On Tue, 5 Apr 2016 20:16:53 +0200
Frank Lichtenheld wrote:

> I assume you wanted to report this against tidy, not www.debian.org?

if so, I always report to the upstream, not the debian's one

see https://www-master.debian.org/build-logs/tidy/
files w/ 142bytes are caused by the issue
(other langs do not have the page [international/l10n/po/pl])

as this is an issue about managing the site, 
you have some choices:
1) fix tidy (upstream or package) and use appropriate option
2) eliminates tidy's output (pipe to sed or use local modified tidy)
3) tamper with the po file
4) ignore this and accept the current situation forever
   (until the Last-Translator changed)

-- 
victory
no need to CC me :-)



Bug#820119: tidy reports valid NCR as invalid

2016-04-05 Thread Frank Lichtenheld
2016-04-05 18:12 GMT+02:00 victory :
>
> Package: www.debian.org

I assume you wanted to report this against tidy, not www.debian.org?

> Severity: wishlist
>
> https://www.w3.org/International/questions/qa-controls#support
> HTML, XHTML and XML 1.0 do not support the C0 range,
> except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A,
> and CR (Carriage Return) U+000D.
> The C1 range is supported, i.e. you can encode the controls directly
> or represent them as NCRs (Numeric Character References).
>
> *
> https://www.w3.org/International/questions/qa-controls#background
> The control codes in the range U+0080-U+009F are known as the "C1" range.
>
> unfortunately no option seems to eliminate this :(
> latest source use the same code (line 1165-)
> https://github.com/htacg/tidy-html5/blob/master/src/lexer.c
>
>
> --
> victory
> no need to CC me :-)
>


-- 
Frank Lichtenheld 



Bug#820119: tidy reports valid NCR as invalid

2016-04-05 Thread victory

Package: www.debian.org
Severity: wishlist

https://www.w3.org/International/questions/qa-controls#support
HTML, XHTML and XML 1.0 do not support the C0 range, 
except for HT (Horizontal Tabulation) U+0009, LF (Line Feed) U+000A, 
and CR (Carriage Return) U+000D. 
The C1 range is supported, i.e. you can encode the controls directly 
or represent them as NCRs (Numeric Character References).

*
https://www.w3.org/International/questions/qa-controls#background
The control codes in the range U+0080-U+009F are known as the "C1" range.

unfortunately no option seems to eliminate this :(
latest source use the same code (line 1165-)
https://github.com/htacg/tidy-html5/blob/master/src/lexer.c


-- 
victory
no need to CC me :-)