Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Paul Hardy
On Sat, Jun 24, 2017 at 8:07 PM, Paul Hardy  wrote:
> On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise  wrote:
>> On Sun, Jun 25, 2017 at 8:54 AM, Simon McVittie wrote:
>>
>>> For what it's worth, I agree that declaring the correct charset in HTTP
>>> metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK 
>>> SPACE
>>> (aka the "byte-order mark") in the file content.
>
> Yes, the BOM was only intended for UTF-16, which could actually have
> two different byte orders.  Because there is no such thing as "byte
> order" with UTF-8, the world wide web has rebranded the UTF-8
> three-byte version of U+FEFF as the "UTF-8 signature".  The original
> intention of The Unicode Consortium was that the sequence would never
> be used in a UTF-8 document.

Russ and I and one other person have alluded to this, but I thought
I'd give the exact text from the Unicode 10.0.0 Standard, which was
just released half a week ago.  The quote is on the bottom of page 67.
The Standard is available at

http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

You can see that times (and attitudes towards the BOM) change (I added
the "0x" for hexadecimal; the Standard uses subscript 16):

"Unicode Signature. An initial BOM may also serve as an implicit
marker to identify a file as containing Unicode text. For UTF-16, the
sequence 0xFE 0xFF (or its byte-reversed counterpart, 0xFF 0xFE) is
exceedingly rare at the outset of text files that use other character
encodings. The corresponding UTF-8 BOM sequence, 0xEF 0xBB 0xBF, is
also exceedingly rare. In either case, it is therefore unlikely to be
confused with real text data. The same is true for both single-byte
and multibyte encodings.

"Data streams (or files) that begin with the U+FEFF byte order mark
are likely to contain Unicode characters. It is recommended that
applications sending or receiving untyped data streams of coded
characters use this signature. If other signaling methods are used,
signa- tures should not be employed."


Paul Hardy



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Paul Hardy
On Sat, Jun 24, 2017 at 8:13 PM, Russ Allbery  wrote:
>
> That's one of the things that confuses me a bit -- why not just use the
> existing HTML files?  ...
>
> I assume you're looking at:
>
> https://www.debian.org/doc/devel-manuals#policy

I did a StartPage search for "debian upgrade checklist" from past
experience, and a link to the ".txt" file was the very first link that
appeared, but I did not see  a link to the corresponding ".html" file.

To my previous list of three options, there is a fourth option, which
I would also be okay with: change nothing and close the bug.  But take
into consideration that ".txt" file is available on Debian via a web
link, so others are likely to see it.


Paul Hardy



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Paul Wise
On Sat, 2017-06-24 at 20:48 -0700, Russ Allbery wrote:

> Can't we just set the character set for the text files that come from
> Debian Policy?  At least with Apache you can set character sets with
> whatever granularity you want.

Doesn't look like there are any files within the Debian Policy
directory that are non-UTF-8, so that should be doable.

pabs@mirror-anu:/srv/static.debian.org/mirrors/www.debian.org/cur$ find -iname 
'*.txt' -print0 | xargs -0 isutf8 | grep policy
./doc/manuals/ddp-policy/ddp-policy.en.txt: line 1476, char 1, byte offset 22: 
invalid UTF-8 code

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


signature.asc
Description: This is a digitally signed message part


Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Russ Allbery
Paul Wise  writes:
> On Sat, 2017-06-24 at 20:07 -0700, Paul Hardy wrote:

>> 2) Set the HTTP headers for charset="UTF-8"

> FYI, there are 1018 non-UTF-8 out of 2605 total *.txt files on the
> Debian website and 9 non-UTF-8 out of 1102 total *.txt files in the
> Debian archive mirrors. It seems feasible to convert the files in the
> Debian archive to UTF-8 but it doesn't seem to be feasible to do that
> for www.debian.org.

Can't we just set the character set for the text files that come from
Debian Policy?  At least with Apache you can set character sets with
whatever granularity you want.

-- 
Russ Allbery (r...@debian.org)   



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Paul Wise
On Sat, 2017-06-24 at 20:07 -0700, Paul Hardy wrote:

> Three possibilities seem to exist, and I am fine with any one being chosen:
> 
> 1) Use the UTF-8 signature in UTF-8 text files

If this triggers browsers to use the right encoding, it seems
reasonable to add it in the situation where the files could be served
by any web server on the Internet. Right now all the mirrors of
www.debian.org are on Debian-controlled servers though, but there are
many non-UTF-8 text files so using the UTF-8 signature seems better.

> 2) Set the HTTP headers for charset="UTF-8"

FYI, there are 1018 non-UTF-8 out of 2605 total *.txt files on the
Debian website and 9 non-UTF-8 out of 1102 total *.txt files in the
Debian archive mirrors. It seems feasible to convert the files in the
Debian archive to UTF-8 but it doesn't seem to be feasible to do that
for www.debian.org.

pabs@mirror-anu:/srv/static.debian.org/mirrors/www.debian.org/cur$ find -iname 
'*.txt' | wc -l
2605
pabs@mirror-anu:/srv/static.debian.org/mirrors/www.debian.org/cur$ find -iname 
*.txt -print0 | xargs -0 isutf8  | wc -l
1018
pabs@mirror-anu:/srv/mirrors/debian$ find -iname '*.txt' | wc -l
1102
pabs@mirror-anu:/srv/mirrors/debian$ find -iname '*.txt' -print0 | xargs -0 
isutf8  | wc -l
9

> 3) Convert UTF-8 text files to HTML documents for web display

Sounds like this is already done.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise


signature.asc
Description: This is a digitally signed message part


Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Russ Allbery
Paul Hardy  writes:

> If using the UTF-8 signature in a document is too aesthetically
> distateful (and I don't disagree), and if setting the HTTP header to
> denote a UTF-8 charset is not a universal solution because it will only
> have effect on Debian's servers, would a tool that converted such text
> files to an HTML document be desirable?  Such a hypothetical tool would
> insert a meta tag in the header saying .

That's one of the things that confuses me a bit -- why not just use the
existing HTML files?  All the text files in Debian Policy (except for
virtual-package-names-list.txt, which doesn't have any UTF-8) are
generated by rendering the HTML to text.  The HTML is there alongside, and
if you're looking at things in a web browser, I'd think they'd be
preferrable.

I assume you're looking at:

https://www.debian.org/doc/devel-manuals#policy

All of the primary links are to HTML.

I'm certainly happy to try to get the text versions correct, but I would
expect most people would prefer to use the HTML versions.

-- 
Russ Allbery (r...@debian.org)   



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Paul Hardy
On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise  wrote:
> On Sun, Jun 25, 2017 at 8:54 AM, Simon McVittie wrote:
>
>> For what it's worth, I agree that declaring the correct charset in HTTP
>> metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK 
>> SPACE
>> (aka the "byte-order mark") in the file content.

Yes, the BOM was only intended for UTF-16, which could actually have
two different byte orders.  Because there is no such thing as "byte
order" with UTF-8, the world wide web has rebranded the UTF-8
three-byte version of U+FEFF as the "UTF-8 signature".  The original
intention of The Unicode Consortium was that the sequence would never
be used in a UTF-8 document.


In Firefox, if you press Ctrl+Shift+Q you will get an "Inspector".  Loading
https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt
with the Netowrk tab selected in the Inspector shows multiple tabs.
The "Console" tab gives this message, highlighted in pink [for
dramatic effect]:

"The character encoding of the plain text document was not declared.
The document will render with garbled text in some browser
configurations if the document contains characters from outside the
US-ASCII range. The character encoding of the file needs to be
declared in the transfer protocol or file needs to use a byte order
mark as an encoding signature."

So the browser is encouraging the use of this three-byte UTF-8 version
of U+FEFF, even though it was never supposed to be used in a document.
We live in an imperfect world.


Going to the Network tab, reloading the page, and clicking on "Raw
Headers" shows the following information (i just made the request
again):

Request Headers:
Host: www.debian.org
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Sat, 24 Jun 2017 20:17:13 GMT
If-None-Match: "e965-552ba67456626-gzip"
Cache-Control: max-age=0

Response Headers:
Accept-Ranges: bytes
Cache-Control: max-age=86400
Connection: Keep-Alive
Content-Encoding: gzip
Content-Length: 18592
Content-Type: text/plain
Date: Sun, 25 Jun 2017 02:10:24 GMT
Etag: "e965-552ba67456626-gzip"
Expires: Mon, 26 Jun 2017 02:10:24 GMT
Keep-Alive: timeout=5, max=100
Last-Modified: Sat, 24 Jun 2017 20:17:13 GMT
Server: Apache
Strict-Transport-Security: max-age=15552000
Vary: Accept-Encoding
X-Clacks-Overhead: GNU Terry Pratchett
X-Content-Type-Options: nosniff
X-Frame-Options: sameorigin
X-XSS-Protection: 1
referrer-policy: no-referrer

So the Content-Type is "text/plain", which results in the "garbled
characters", to quote the Firefox Console window in the Inspector.  As
an aside, the Content-Encoding is "gzip", which is a good thing.


On Sat, Jun 24, 2017 at 7:12 PM, Paul Wise  wrote:
> Forcing every text file to UTF-8 isn't the correct solution either,
> since it breaks text files that are not encoded in UTF-8 (such as old
> dedication texts) and does not work on Debian mirrors that are not
> controlled by us.

If using the UTF-8 signature in a document is too aesthetically
distateful (and I don't disagree), and if setting the HTTP header to
denote a UTF-8 charset is not a universal solution because it will
only have effect on Debian's servers, would a tool that converted such
text files to an HTML document be desirable?  Such a hypothetical tool
would insert a meta tag in the header saying .

If that is an acceptable solution, I could put together an awk script
for Debian (if it would get used) that would employ awk's BEGIN and
END sections to wrap a UTF-8 document in HTML tags, enclosing the text
itself in ... tags.  That would mean that Debian UTF-8
documents intended for being served on the web would have to run such
a utility and be converted into HTML pages for display.

Three possibilities seem to exist, and I am fine with any one being chosen:

1) Use the UTF-8 signature in UTF-8 text files
2) Set the HTTP headers for charset="UTF-8"
3) Convert UTF-8 text files to HTML documents for web display


Paul Hardy



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Paul Wise
On Sun, Jun 25, 2017 at 8:54 AM, Simon McVittie wrote:

> For what it's worth, I agree that declaring the correct charset in HTTP
> metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK SPACE
> (aka the "byte-order mark") in the file content.

Forcing every text file to UTF-8 isn't the correct solution either,
since it breaks text files that are not encoded in UTF-8 (such as old
dedication texts) and does not work on Debian mirrors that are not
controlled by us.

-- 
bye,
pabs

https://wiki.debian.org/PaulWise



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Simon McVittie
On Sat, 24 Jun 2017 at 15:04:41 -0700, Russ Allbery wrote:
> Stéphane Blondon  writes:
> > pabs added such configuration few days ago for Apache configuration:
> > https://anonscm.debian.org/cgit/mirror/dsa-puppet.git/commit/?id=5bcf8431d6b375d211a29f9d2c338e4400332e1a
> 
> Paul, does this resolve your original issue?  I'm a bit worried that it
> might not because this was a little bit ago, and I think it should have
> already been in place when you were testing.

That configuration change was for ftp.debian.org, not for www. A
similar change to apache-www.debian.org.erb would probably do the right
thing.

http://ftp.debian.org/debian/doc/bug-log-access.txt does look correct
to me (with a Unicode copyright symbol near the end) in Firefox 52.2.0.
Compare:

$ curl --silent -o /dev/null --dump-header - \
  http://ftp.debian.org/debian/doc/bug-log-access.txt \
  | grep Content-Type
X-Content-Type-Options: nosniff
Content-Type: text/plain; charset=utf-8

$ curl --silent -o /dev/null --dump-header - \
  https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt \
  | grep -i Content-Type
X-Content-Type-Options: nosniff
Content-Type: text/plain

For what it's worth, I agree that declaring the correct charset in HTTP
metadata is a better solution than prepending U+FEFF ZERO WIDTH NO-BREAK SPACE
(aka the "byte-order mark") in the file content.

Regards,
S



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Russ Allbery
Stéphane Blondon  writes:

> pabs added such configuration few days ago for Apache configuration:
> https://anonscm.debian.org/cgit/mirror/dsa-puppet.git/commit/?id=5bcf8431d6b375d211a29f9d2c338e4400332e1a

> The reason is a bad display in some browser for UTF-8 encoded txt file.
> The start of this thread:
> https://lists.debian.org/debian-www/2017/06/msg00068.html

Perfect, thank you!

Paul, does this resolve your original issue?  I'm a bit worried that it
might not because this was a little bit ago, and I think it should have
already been in place when you were testing.

-- 
Russ Allbery (r...@debian.org)   



Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Stéphane Blondon
Le 24/06/2017 à 20:44, Russ Allbery a écrit :
> debian-www folks, is there a way to declare UTF-8 as the charset for all
> the *.txt files that originate from the debian-policy package and are
> served by www.debian.org?  I can guarantee that all the text files shipped
> as part of the Policy package will be valid UTF-8.
> 

pabs added such configuration few days ago for Apache configuration:
https://anonscm.debian.org/cgit/mirror/dsa-puppet.git/commit/?id=5bcf8431d6b375d211a29f9d2c338e4400332e1a


The reason is a bad display in some browser for UTF-8 encoded txt file.
The start of this thread:
https://lists.debian.org/debian-www/2017/06/msg00068.html

-- 
Stéphane



signature.asc
Description: OpenPGP digital signature


Bug#865713: Declaring a charset of UTF-8 for policy files

2017-06-24 Thread Russ Allbery
debian-www, not debian-web.

Colin Watson  writes:
> On Fri, Jun 23, 2017 at 11:49:20PM -0700, Russ Allbery wrote:

>> I'm still a bit dubious about this, since I don't believe editors and
>> generators normally add it, but given how we generate the text versions
>> of the documents, it's relatively easy to add a leading BOM and seems
>> harmless.  I'll take a look.

> I share the discomfort in your previous message with using the UTF-8
> BOM.  I'd have thought that a better approach here would be to fix this
> at the HTTP layer:
> https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt
> (and other text files here) should return "Content-Type: text/plain;
> charset=UTF-8", not just "Content-Type: text/plain".

debian-www folks, is there a way to declare UTF-8 as the charset for all
the *.txt files that originate from the debian-policy package and are
served by www.debian.org?  I can guarantee that all the text files shipped
as part of the Policy package will be valid UTF-8.

-- 
Russ Allbery (r...@debian.org)   



Bug#865713: Declaring a charset of UTF-8 for policy files (was: Re: Bug#865713: Please Start UTF-8 debian-policy Text Files with UTF-8 Signature)

2017-06-24 Thread Russ Allbery
Colin Watson  writes:
> On Fri, Jun 23, 2017 at 11:49:20PM -0700, Russ Allbery wrote:

>> I'm still a bit dubious about this, since I don't believe editors and
>> generators normally add it, but given how we generate the text versions
>> of the documents, it's relatively easy to add a leading BOM and seems
>> harmless.  I'll take a look.

> I share the discomfort in your previous message with using the UTF-8
> BOM.  I'd have thought that a better approach here would be to fix this
> at the HTTP layer:
> https://www.debian.org/doc/packaging-manuals/upgrading-checklist.txt
> (and other text files here) should return "Content-Type: text/plain;
> charset=UTF-8", not just "Content-Type: text/plain".

debian-web folks, is there a way to declare UTF-8 as the charset for all
the *.txt files that originate from the debian-policy package and are
served by www.debian.org?  I can guarantee that all the text files shipped
as part of the Policy package will be valid UTF-8.

-- 
Russ Allbery (r...@debian.org)