Re: UTF-8 in jessie

2013-10-14 Thread Johannes Schauer
Hi,

Quoting Adam Borowski (2013-08-12 02:51:52)
 On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote:
  now might be the right time to start a discussion about release goals
  for jessie.
 
 I would like to propose full UTF-8 support.  I don't mean here full
 support for all of Unicode's finer points, merely complete eradication of
 mojibake.  That is, ensuring that /m.o/ matches möo, or that ä sorts
 as equal to acombining ¨ is out of scope of this proposal.
 
 I propose the following sub-goals:
 
 1. all programs should, in their default configuration, accept UTF-8 input
and pass it through uncorrupted.  Having to manually specify encoding
is acceptable only in a programmatic interface, GUI/std{in,out,err}/
command line/plain files should work with nothing but LC_CTYPE.

as an addendum to this release goal proposal, it is maybe also worth mentioning
working multibyte character support in coreutils as a possible goal.

From http://bugs.debian.org/139861 :

$ echo -e 日\n本\nで\nは | sort -u | wc -l
3
$ echo -e 日\n本\nで\nは | sort | wc -l
4

Or having head/tail which work character base instead of byte based would be
sweet as well.

While upstream doesnt seem to support this, it seems that Fedora has a patch
for coreutils:

http://pkgs.fedoraproject.org/cgit/coreutils.git/tree/coreutils-i18n.patch?id=6e10f376996b64f538259091a524df2249b653fb;id2=HEAD

or also:

http://trac.cross-lfs.org/browser/patches/coreutils-6.12-unicode-1.patch?rev=577dd2d59133e10bd32c58844293e93af0e6f162

cheers, josch


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20131014105058.7934.26083@hoothoot



Re: UTF-8 in jessie

2013-09-17 Thread Jakub Wilk

* Adam Borowski kilob...@angband.pl, 2013-08-12, 02:51:

Detecting non-UTF files is easy:
* false positives are impossible
* false negatives are extremely unlikely: combinations of letters that would 
happen to match a valid utf character don't happen naturally, and even if they 
did, every single combination in the file tested would need to match valid 
utf.


Not+IAo-quite. While 7-bit encodings different than ASCII are all endangered 
species, some of them can still be seen in the wild, and they excellently 
disguise themselves as UTF-8. (We had to add special code to detect ISO-2022 
encodings to Lintian not that long ago.)


Anyway, it you want to help UTF-8-ize the world, you could start by providing 
patches for these bugs:

http://lintian.debian.org/tags/debian-changelog-file-uses-obsolete-national-encoding.html
http://lintian.debian.org/tags/debian-copyright-file-uses-obsolete-national-encoding.html
http://lintian.debian.org/tags/doc-base-file-uses-obsolete-national-encoding.html

--
Jakub Wilk


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130917161026.ga6...@jwilk.net



Re: UTF-8 in jessie

2013-08-29 Thread Ian Jackson
Adam Borowski writes (Re: UTF-8 in jessie):
 Let's take a look at some sheets.

Last time I looked at this I found a copy of the actual ASCII
standards document from 1968 or so and it did mention this usage.

  I don't think that better UTF-8 support should involve needlessly
  converting 7-bit ASCII text files which use ` ' as matched quotes,
  into UTF-8 text files which use non-ISO-646 codepoints.
 
 These code points are defined to be exactly the same in both ASCII and
 Unicode.  Only fonts may differ.  And like Han unification issues, this
 is out of scope here.

Do you intend that text files containing uses of ` ' as matched single
quotes should be changed to use non-7-bit BMP matched single quotes ?
It seems that you don't.

In which case I'm afraid you will have to make this explicit somehow
in your proposal.  Otherwise zealous people will go around complaining
about funny-looking quotes and changing a whole bunch of text files to
no longer be 7-bit.

See GCC's error messages, for a case in point.

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/21023.14041.902602.551...@chiark.greenend.org.uk



Re: UTF-8 in jessie

2013-08-29 Thread Jonas Smedegaard
Quoting Ian Jackson (2013-08-29 13:56:09)
 Adam Borowski writes (Re: UTF-8 in jessie):
  Let's take a look at some sheets.
 
 Last time I looked at this I found a copy of the actual ASCII 
 standards document from 1968 or so and it did mention this usage.
 
   I don't think that better UTF-8 support should involve needlessly 
   converting 7-bit ASCII text files which use ` ' as matched quotes, 
   into UTF-8 text files which use non-ISO-646 codepoints.
  
  These code points are defined to be exactly the same in both ASCII 
  and Unicode.  Only fonts may differ.  And like Han unification 
  issues, this is out of scope here.
 
 Do you intend that text files containing uses of ` ' as matched single 
 quotes should be changed to use non-7-bit BMP matched single quotes ? 
 It seems that you don't.
 
 In which case I'm afraid you will have to make this explicit somehow 
 in your proposal.  Otherwise zealous people will go around complaining 
 about funny-looking quotes and changing a whole bunch of text files to 
 no longer be 7-bit.
 
 See GCC's error messages, for a case in point.

I believe the underlying issue is the one summarized here: 
https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding

If that is correct, then the issue here is not whether ASCII ` equals 
UTF-8 ' (or some similar recoding), but instead that _authors_ from an 
era of looking at output representing ' as ` grew a habit of typing back 
into documents that other character.

How about we simply mention explicitly that `arcane quoting' - even if 
arguably related to UTF-8 encoding, should be classified not as 
release-critical bugs but as spelling errors.


 - Jonas

-- 
 * Jonas Smedegaard - idealist  Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private


signature.asc
Description: signature


Re: UTF-8 in jessie

2013-08-29 Thread Ian Jackson
Jonas Smedegaard writes (Re: UTF-8 in jessie):
 I believe the underlying issue is the one summarized here: 
 https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding

Yes.

 How about we simply mention explicitly that `arcane quoting' - even if 
 arguably related to UTF-8 encoding, should be classified not as 
 release-critical bugs but as spelling errors.

I don't think it is a bug.  What I'm trying to forestall is a campaign
to change documents which use ` '.

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/21023.28874.354634.158...@chiark.greenend.org.uk



Re: UTF-8 in jessie

2013-08-29 Thread Jonas Smedegaard
Quoting Ian Jackson (2013-08-29 18:03:22)
 Jonas Smedegaard writes (Re: UTF-8 in jessie):
 I believe the underlying issue is the one summarized here: 
 https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding
 
 Yes.
 
 How about we simply mention explicitly that `arcane quoting' - even 
 if arguably related to UTF-8 encoding, should be classified not as 
 release-critical bugs but as spelling errors.
 
 I don't think it is a bug.  What I'm trying to forestall is a campaign 
 to change documents which use ` '.

My aim was the same, but I see how severity minor still implies IT'S 
A BUG!!.

How about this, then:

  Although arguably related to UTF-8 encoding, `arcane quoting' is not 
  part of this release-goal.


 - Jonas

-- 
 * Jonas Smedegaard - idealist  Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private


signature.asc
Description: signature


Re: UTF-8 in jessie

2013-08-29 Thread Ian Jackson
Jonas Smedegaard writes (Re: UTF-8 in jessie):
 Quoting Ian Jackson (2013-08-29 18:03:22)
  Jonas Smedegaard writes (Re: UTF-8 in jessie):
  I believe the underlying issue is the one summarized here: 
  https://en.wikipedia.org/wiki/Typewriter_apostrophe#ASCII_encoding
...
 My aim was the same, but I see how severity minor still implies IT'S 
 A BUG!!.
 
 How about this, then:
 
   Although arguably related to UTF-8 encoding, `arcane quoting' is not 
   part of this release-goal.

I'm happy with ... is not part of this release goal.  I'm not sure
that everyone will know what `arcane quoting' means.

How about

  Changing any 7-bit characters (for example, changing the way we deal
  with `'-quoting) is not part of this release-goal.

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/21023.31276.110369.358...@chiark.greenend.org.uk



Re: UTF-8 in jessie

2013-08-29 Thread Russ Allbery
Ian Jackson ijack...@chiark.greenend.org.uk writes:
 Jonas Smedegaard writes (Re: UTF-8 in jessie):

 How about we simply mention explicitly that `arcane quoting' - even if
 arguably related to UTF-8 encoding, should be classified not as
 release-critical bugs but as spelling errors.

 I don't think it is a bug.  What I'm trying to forestall is a campaign
 to change documents which use ` '.

Well, personally I think that's ugly, and if it bothers me sufficiently I
might file a minor bug about it, but I certainly agree that it's not a
suitable subject for a mass bug filing.  And surely that's obvious?  I
wouldn't have even thought of the relationship of backtick in ASCII to
this proposal if you hadn't mentioned it, and it seems unlikely to me that
we're going to have many people who think this is what it means.

-- 
Russ Allbery (r...@debian.org)   http://www.eyrie.org/~eagle/


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/87r4dcflqa@windlord.stanford.edu



Re: UTF-8 in jessie

2013-08-28 Thread Ian Jackson
Adam Borowski writes (UTF-8 in jessie):
 I would like to propose full UTF-8 support.  I don't mean here full
 support for all of Unicode's finer points, merely complete eradication of
 mojibake.  That is, ensuring that /m.o/ matches möo, or that ä sorts
 as equal to acombining ¨ is out of scope of this proposal.

I agree with everything you propose except that I have one reservation
regarding this:

 4. all text files should be encoded in UTF-8

I agree with this except that I think it should be permitted that a
text file uses ASCII codepoints.

You may say but UTF-8 is a superset of ASCII.  Well, no, it isn't.
UTF-8 is a superset of ISO-646 but ISO-646 is not identical to ASCII.
In particular the descriptions of the codepoints ` ' in ISO-646
effectively forbids them from being used as matching single quotes,
despite that being specified as allowed in ASCII.

I don't think that better UTF-8 support should involve needlessly
converting 7-bit ASCII text files which use ` ' as matched quotes,
into UTF-8 text files which use non-ISO-646 codepoints.

(In fact I would like to see Markus Kuhn's decision about ` ' reversed
- our default character set should be ASCII for 0..127 plus UTF for
the rest.  That's not an argument I expect to win but at the very
least we shouldn't have to worsify things for ASCII users.)

Thanks,
Ian.


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/21022.5425.511942.342...@chiark.greenend.org.uk



Re: UTF-8 in jessie

2013-08-28 Thread Dmitrijs Ledkovs
On 12 August 2013 01:51, Adam Borowski kilob...@angband.pl wrote:

 3. all file names must be valid UTF-8


Case in point errors from ubuntu UDD package importer:

Packages containing non-UTF-8, non-ASCII filenames. This is a problem.
It is unclear how to sensibly map these into Bazaar.

anon-proxy aspell-is aspell-pt aspell-ro cvsnt dacco egroupware ewiki
firebird2 fortunes-pl fslint glest-data gmoo ii-esu ispell-fo jpilot
kdeedu liblingua-de-ascii-perl magyarispell mtink ooohg openverse
phpgedview phpgroupware projectl qcad qdvdauthor tatan tuxpaint uae
xblast-tnt xblast-tnt-levels


Not sure how up to date this list is (it could be a historic package
version that has non-UTF8/non-ASCII filenames.) Test  file bugs?

Regards,

Dmitrijs.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/canbhlui_ch-z7y+ga4f2ui81if2gjog-cxccmfzuzumpzjf...@mail.gmail.com



Re: UTF-8 in jessie

2013-08-28 Thread Adam Borowski
On Wed, Aug 28, 2013 at 04:20:17PM +0100, Ian Jackson wrote:
 Adam Borowski writes (UTF-8 in jessie):
  I would like to propose full UTF-8 support.  I don't mean here full
  support for all of Unicode's finer points, merely complete eradication of
  mojibake.  That is, ensuring that /m.o/ matches möo, or that ä sorts
  as equal to acombining ¨ is out of scope of this proposal.
 
 I agree with everything you propose except that I have one reservation
 regarding this:
 
  4. all text files should be encoded in UTF-8
 
 I agree with this except that I think it should be permitted that a
 text file uses ASCII codepoints.
 
 You may say but UTF-8 is a superset of ASCII.  Well, no, it isn't.

Uhm, how?

 UTF-8 is a superset of ISO-646 but ISO-646 is not identical to ASCII.
 In particular the descriptions of the codepoints ` ' in ISO-646
 effectively forbids them from being used as matching single quotes,
 despite that being specified as allowed in ASCII.

Let's take a look at some sheets.

Feb 1972:
https://en.wikipedia.org/wiki/File:ASCII_Code_Chart-Quick_ref_card.jpg

1967/68:
http://www.samhallas.co.uk/repository/telegraph/teletype_33_specs.pdf

` and ' don't look like anything resembling matching quotes to me.
Usually ' is vertical or slightly slanted, ` tends to be at 45 degrees
or quite close to horizontal.

 I don't think that better UTF-8 support should involve needlessly
 converting 7-bit ASCII text files which use ` ' as matched quotes,
 into UTF-8 text files which use non-ISO-646 codepoints.

These code points are defined to be exactly the same in both ASCII and
Unicode.  Only fonts may differ.  And like Han unification issues, this
is out of scope here.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130828180712.ga8...@angband.pl



Re: UTF-8 in jessie

2013-08-18 Thread gregor herrmann
On Mon, 12 Aug 2013 02:51:52 +0200, Adam Borowski wrote:

 4a. perl and pod
 
 Considering perl to be text raises one more issue: pod.  By perl's design,
 pod without a specified encoding is considered to be ISO-8859-1, even if
 the file contains use utf8;.  This is surprising, and many authors use
 UTF-8 like everywhere else, leading to obvious results (man gdm3 for one
 example).  Thus, there should be a tool (preferably the one mentioned
 above) that checks perl files for pod with undeclared encoding, and raises
 alarm if the file contains any bytes with high bit set.  If a conversion
 encoding is specified, such a declaration could be added automatically.

This tool exists, and it's called perl 5.18 :)
(Ok, more like Pod::Simple or pod2man or whatever that now errors out
on non-ASCII-chars in pod without =encoding.)
 
The results can be seen at
http://bugs.debian.org/cgi-bin/pkgreport.cgi?tag=perl-5.18-transition;users=debian-p...@lists.debian.org
(half of the POD errors bugs).


Cheers,
gregor

-- 
 .''`.  Homepage: http://info.comodo.priv.at/ - OpenPGP key 0xBB3A68018649AA06
 : :' : Debian GNU/Linux user, admin, and developer  -  http://www.debian.org/
 `. `'  Member of VIBE!AT  SPI, fellow of the Free Software Foundation Europe
   `-   NP: Tom Waits: Come On Up To The House


signature.asc
Description: Digital signature


Re: UTF-8 in jessie

2013-08-13 Thread Christian PERRIER
Quoting Charles Plessy (ple...@debian.org):

 About display by GUIs, I think that we should have a system to install all the
 fonts necessary to display languages that we support at the installation.


Such as tasksel and its language tasks? :-)

In short, we already have that. However, we need people to maintain
that, namely to decide what fonts should be installed when a given
language is chosen at install time.

This is usually asked to contributing new translators of D-I and,
therefore, most languages that are not Latin-something based will
trigger the installation of a font package that is suitable for them.

I also try to move the maintenance of such font packages under the
(large) umbrella of the pkg-fonts maintenance team, as the maintenance
of font packages is usually very loose.



signature.asc
Description: Digital signature


Re: UTF-8 in jessie

2013-08-13 Thread Charles Plessy
Le Tue, Aug 13, 2013 at 08:12:24AM +0200, Christian PERRIER a écrit :
 Quoting Charles Plessy (ple...@debian.org):
 
  About display by GUIs, I think that we should have a system to install all 
  the
  fonts necessary to display languages that we support at the installation.
 
 
 Such as tasksel and its language tasks? :-)
 
 In short, we already have that. However, we need people to maintain
 that, namely to decide what fonts should be installed when a given
 language is chosen at install time.

Hi Christian,

what I am proposing is a task that install all languages.  I made a bit of
research earlier, and it is not as simple as installing all the existing tasks,
as the result on my computer was that some browsers started to display Japanese
texts with simplified Chinese glyphs.

http://bugs.debian.org/702050

Unfortunately, I did not get answer.  Feedback is much welcome.

Cheers,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130813063123.ge6...@falafel.plessy.net



Re: UTF-8 in jessie

2013-08-13 Thread Christian PERRIER
Quoting Charles Plessy (ple...@debian.org):

 Hi Christian,
 
 what I am proposing is a task that install all languages.  I made a bit of
 research earlier, and it is not as simple as installing all the existing 
 tasks,
 as the result on my computer was that some browsers started to display 
 Japanese
 texts with simplified Chinese glyphs.
 
 http://bugs.debian.org/702050
 
 Unfortunately, I did not get answer.  Feedback is much welcome.


It's quite likely the result of broken (or wrong) fontconfig files in
some of the installed fonts. For instance, fonts-arphic-u{kai|ming}
currently spit out warnings from their fontconfig files.




signature.asc
Description: Digital signature


Re: UTF-8 in jessie

2013-08-13 Thread Bastien ROUCARIES
On Mon, Aug 12, 2013 at 5:56 PM, Thorsten Glaser t...@mirbsd.de wrote:
 Florian Lohoff f at zz.de writes:

 5. All programs consuning UTF8 Text must understand a BOM.

 The kernel doesn’t, start there:

 tglase@tglase:~$ mksh -c 'print '\''\ufeff#!/bin/sh\necho foo'\' x; chmod +x
 x; ./x
 ./x: line 1: #!/bin/sh: No such file or directory
 foo

 That’s running GNU bash, with bash as /bin/sh for testing, which deviates
 from my normal setup of running mksh… because I fixed mksh to support this
 (and the MirBSD kernel, too).

They was the utf8script package
http://packages.qa.debian.org/u/utf8script.html but it was O/RM some
time ago. Time to ressurect ?


 I disagree with requiring ASCII for $PATH though…

 bye,
 //mirabilos


 --
 To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
 with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
 Archive: http://lists.debian.org/loom.20130812t175549-...@post.gmane.org



--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/CAE2SPAa6jMkh=q2h0c0bbpnfz8k-teuptwok1e32regi9hs...@mail.gmail.com



Re: UTF-8 in jessie

2013-08-13 Thread Thorsten Glaser
Vincent Lefevre vincent at vinc17.net writes:

 If scripts intend to use LC_ALL=C.UTF-8 to force everything to
 the standard locale with UTF-8 support, then the glibc should
 be modified to regard C.UTF-8 like C w.r.t. $LANGUAGE. I mean:

Ouch! Scripts do, and this *is* how C.UTF-8 was intended: to
behave like C/POSIX except for the encoding.

 Both should have output in English.

Please reportbug that.

Thanks,
//mirabilos


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/loom.20130813t122454-...@post.gmane.org



Re: UTF-8 in jessie

2013-08-13 Thread Vincent Lefevre
On 2013-08-13 10:25:31 +, Thorsten Glaser wrote:
 Vincent Lefevre vincent at vinc17.net writes:
 
  If scripts intend to use LC_ALL=C.UTF-8 to force everything to
  the standard locale with UTF-8 support, then the glibc should
  be modified to regard C.UTF-8 like C w.r.t. $LANGUAGE. I mean:
 
 Ouch! Scripts do, and this *is* how C.UTF-8 was intended: to
 behave like C/POSIX except for the encoding.
 
  Both should have output in English.
 
 Please reportbug that.

OK: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719590

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130813120507.ga18...@xvii.vinc17.org



Re: UTF-8 in jessie

2013-08-13 Thread Florian Lohoff
On Mon, Aug 12, 2013 at 05:58:20PM +0200, Adam Borowski wrote:
 On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
  5. All programs consuning UTF8 Text must understand a BOM.
 
 I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose
 once you standardize on UTF8.  They might help with exchange with a minority
 of Windows programs, at a cost at our side.  Windows hardly does plain text:
 most of that is MSVC/etc sources, but then, the C/C++ standards explicitely
 forbid junk in places other than comments.  Most other languages expect a
 hashbang on Unix, which makes BOMs impossible.

I agree that BOMs are nasty and should not be generated by our standard
tools. 

I have been bitten by BOMs more than once and had a hard time looking
for the fault until looking at the plain ascii file with a hex editor.
AFAIK Tools like vim understand and hide the fact that there is a BOM
and rewrite them.

Other tools give interesting results stumping on a BOM.

So its inconstistent which makes it hard to find.

Flo
-- 
Florian Lohoff f...@zz.de


signature.asc
Description: Digital signature


Re: UTF-8 in jessie (debhelper and BOM)

2013-08-13 Thread Adam Borowski
On Tue, Aug 13, 2013 at 01:44:03PM +0900, Osamu Aoki wrote:
 But I do not understand goal #5.  Why MUST?  Do you have rationale?
 
 On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
  On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote:
   I propose the following sub-goals:
 ...
   4. all text files should be encoded in UTF-8
 
 Yes.  But it will be nice to have some support by dh_installdocs :-)
   ^^
 
  5. All programs consuming UTF8 Text must understand a BOM.
   
 
 I agree as SHOULD but should we state MUST? 

Please note the number of '' markers.

It is not part of my proposal, and the discussion in this thread, me
included, seems to be pretty hostile to adding BOMs.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130813224409.gc3...@angband.pl



Re: UTF-8 in jessie

2013-08-12 Thread Niels Thykier
On 2013-08-12 04:18, Charles Plessy wrote:
 Le Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski a écrit :

 I would like to propose full UTF-8 support.  I don't mean here full
 support for all of Unicode's finer points, merely complete eradication of
 mojibake.
 
 Hi Adam,
 

Hi,

 this is a great goal.  Here are two comments.
 
 There is a related issue opened on the Policy 
 (http://bugs.debian.org/701081),
 where we propose the following:
 
  - Require UTF-8 for the names of all files and directories installed by 
 binary packages.

For the record, there is a Lintian tag for this now[1], which suggests
only a handful of packages violates this.

  - Recommend ASCII when possible.
  - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and /usr/games.
 

Requiring ASCII for files in $PATH should be trivial to implement as a
separate tag.  I suppose the ASCII requirement could also be implemented
as a pedantic check or so.  Regardless, patches welcome.  :)

 About display by GUIs, I think that we should have a system to install all the
 fonts necessary to display languages that we support at the installation.
 
 Have a nice Debconf !
 

~Niels

[1] http://lintian.debian.org/tags/file-name-is-not-valid-UTF-8.html


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/520895a6.1090...@thykier.net



Re: UTF-8 in jessie

2013-08-12 Thread Vincent Lefevre
On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
 Detecting non-UTF files is easy:
 * false positives are impossible
 * false negatives are extremely unlikely: combinations of letters that would
   happen to match a valid utf character don't happen naturally, and even if
   they did, every single combination in the file tested would need to match
   valid utf.

Not that unlikely, and it is rather annoying that Firefox (and
therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
IMHO, in case of ambiguity, UTF-8 should always be preferred by
default (applications could have options to change the preferences).

Bug reports:
  https://bugzilla.mozilla.org/show_bug.cgi?id=760050
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=719481

 On the other hand, detecting text files is hard.

Deciding whether a file is a text file may be hard even for a human.
What about text files with ANSI control sequences?

 The best tool so far, file, makes so many errors it's useless for
 this purpose.

Yes.

 One could use location: like, declaring stuff in /etc/ and
 /usr/share/doc/ to be text unless proven otherwise, but that's an
 incomplete hack. Only hashbangs can be considered reliable, but
 scripts are not where most documentation goes.
 
 Also, should HTML be considered text or not?  Updating http-equiv is not
 rocket surgery, detecting HTML with fancy extensions can be.

I think better questions could be: why do you want to regard a file as
text? For what purpose(s)? For the all shipped text files in UTF-8
rule only?

What about examples whose purpose is to have a file in a charset
different from UTF-8?

 4a. perl and pod
 
 Considering perl to be text raises one more issue: pod.  By perl's design,
 pod without a specified encoding is considered to be ISO-8859-1, even if
 the file contains use utf8;.  This is surprising, and many authors use
 UTF-8 like everywhere else, leading to obvious results (man gdm3 for one
 example).  Thus, there should be a tool (preferably the one mentioned
 above) that checks perl files for pod with undeclared encoding, and raises
 alarm if the file contains any bytes with high bit set.  If a conversion
 encoding is specified, such a declaration could be added automatically.

Yes, undeclared encoding when not ASCII should be regarded as a bug.

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812105035.ga28...@xvii.vinc17.org



Re: UTF-8 in jessie

2013-08-12 Thread Adam Borowski
On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote:
 On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
  Detecting non-UTF files is easy:
  * false positives are impossible
  * false negatives are extremely unlikely: combinations of letters that would
happen to match a valid utf character don't happen naturally, and even if
they did, every single combination in the file tested would need to match
valid utf.
 
 Not that unlikely, and it is rather annoying that Firefox (and
 therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
 IMHO, in case of ambiguity, UTF-8 should always be preferred by
 default (applications could have options to change the preferences).

That's the opposite of what I'm talking about: it is hard to reliably detect
ancient encodings, because they tend to assign a character to every possible
bit stream.  On the other hand, only certain combinations of bytes with the
8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with
good accuracy.  It is obviously trivial to fool such detection deliberately,
but such combinations don't happen in real languages, and thus if something
validates as UTF-8, it is safe to assume it indeed is.
 
  On the other hand, detecting text files is hard.
 
 Deciding whether a file is a text file may be hard even for a human.
 What about text files with ANSI control sequences?

Same as, say, a Word97 document: not text for my purposes.  It might be
just coloured plain text, but there is no generic way to handle that.
Binary formats go more into subgoal 1 of my proposal: arbitrary Unicode
input that matches your syntax should be accepted, and go out uncorrupted
(not the same as unmodified).
 
  One could use location: like, declaring stuff in /etc/ and
  /usr/share/doc/ to be text unless proven otherwise, but that's an
  incomplete hack. Only hashbangs can be considered reliable, but
  scripts are not where most documentation goes.
  
  Also, should HTML be considered text or not?  Updating http-equiv is not
  rocket surgery, detecting HTML with fancy extensions can be.
 
 I think better questions could be: why do you want to regard a file as
 text? For what purpose(s)? For the all shipped text files in UTF-8
 rule only?

A shipped config file will have some settings the user may edit and comments
he may read.  Being able to see what's going on is a prerequisite here.

A perl/python/etc script is something our kind of folks often edit and/or
read.

A plain text file ships no encoding information, thus it can't be either
rendered nor edited comfortably if the encoding is different from the system
one.

HTML can include http-equiv which take care of rendering, but editing is
still a problem.  And if you edit it, or, say, fill in some fields from a
database, you risk data loss.  If everything is UTF-8 end-to-end, this risk
goes away.  (I do care about plain text more, though.)
 
 What about examples whose purpose is to have a file in a charset
 different from UTF-8?

Well, we don't convert those :)

I don't expect a package with a test suite that includes charset stuff to
make such an error by itself, but if there's a need, we could add a syntax
for exclusions.  For example, writing verbatim in the charset field.

  4a. perl and pod
  
  Considering perl to be text raises one more issue: pod.  By perl's design,
  pod without a specified encoding is considered to be ISO-8859-1, even if
  the file contains use utf8;.  This is surprising, and many authors use
  UTF-8 like everywhere else, leading to obvious results (man gdm3 for one
  example).  Thus, there should be a tool (preferably the one mentioned
  above) that checks perl files for pod with undeclared encoding, and raises
  alarm if the file contains any bytes with high bit set.  If a conversion
  encoding is specified, such a declaration could be added automatically.
 
 Yes, undeclared encoding when not ASCII should be regarded as a bug.

And if it's declared but not UTF-8, I'd convert it at package build time.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812131659.ga21...@angband.pl



Re: UTF-8 in jessie

2013-08-12 Thread Adam Borowski
On Mon, Aug 12, 2013 at 09:58:30AM +0200, Niels Thykier wrote:
 For the record, there is a Lintian tag for this now[1], which suggests
 only a handful of packages violates this.
 
   - Recommend ASCII when possible.
   - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and 
  /usr/games.
 
 Requiring ASCII for files in $PATH should be trivial to implement as a
 separate tag.  I suppose the ASCII requirement could also be implemented
 as a pedantic check or so.  Regardless, patches welcome.  :)

I disagree here: I'd want to remove any need for that recommendation
instead.  You might have a point about files in $PATH, though.
 
  About display by GUIs, I think that we should have a system to install all 
  the
  fonts necessary to display languages that we support at the installation.

Could be good, yeah.  At least something basic for every valid Unicode
character.

On the other hand, for me at least CJK doesn't functionally differ from
mojibake.  Which can lead to problems: on debconf 11 CW party, the best
stuff came in a bottle marked only in Japanese, and thus I'll have to find
which one it was the hard way today :p

Jokes aside, enough of Unicode consists of line drawing, symbols and images
like  U+1F4A9 PILE OF POO[2] that are readable by everyone with appropriate
fonts that we might as well just go for 100% coverage by default.  Disk
space is cheap even on weakest of today's phones, complex packaging with
moving parts has serious maintenance cost.


 [1] http://lintian.debian.org/tags/file-name-is-not-valid-UTF-8.html

[2] apt-get install ttf-ancient-fonts
Yeah, aptly named.

-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812135503.ga24...@angband.pl



Re: UTF-8 in jessie

2013-08-12 Thread Florian Lohoff
On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote:
 I propose the following sub-goals:
 
 1. all programs should, in their default configuration, accept UTF-8 input
and pass it through uncorrupted.  Having to manually specify encoding
is acceptable only in a programmatic interface, GUI/std{in,out,err}/
command line/plain files should work with nothing but LC_CTYPE.
 
 2. all GUI/curses/etc programs should be able to display UTF-8 output where
appropriate
 
 3. all file names must be valid UTF-8
 
 4. all text files should be encoded in UTF-8

5. All programs consuning UTF8 Text must understand a BOM.

Flo
-- 
Florian Lohoff f...@zz.de


signature.asc
Description: Digital signature


Re: UTF-8 in jessie

2013-08-12 Thread Thorsten Glaser
Florian Lohoff f at zz.de writes:

 5. All programs consuning UTF8 Text must understand a BOM.

The kernel doesn’t, start there:

tglase@tglase:~$ mksh -c 'print '\''\ufeff#!/bin/sh\necho foo'\' x; chmod +x
x; ./x
./x: line 1: #!/bin/sh: No such file or directory
foo

That’s running GNU bash, with bash as /bin/sh for testing, which deviates
from my normal setup of running mksh… because I fixed mksh to support this
(and the MirBSD kernel, too).


I disagree with requiring ASCII for $PATH though…

bye,
//mirabilos


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/loom.20130812t175549-...@post.gmane.org



Re: UTF-8 in jessie

2013-08-12 Thread Adam Borowski
On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
 5. All programs consuning UTF8 Text must understand a BOM.

I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose
once you standardize on UTF8.  They might help with exchange with a minority
of Windows programs, at a cost at our side.  Windows hardly does plain text:
most of that is MSVC/etc sources, but then, the C/C++ standards explicitely
forbid junk in places other than comments.  Most other languages expect a
hashbang on Unix, which makes BOMs impossible.

Other reasons:
* concatenating files adds a misplaced BOM
* taking stuff from the middle loses them
* tools like grep, patch, etc pick and insert lots of individual lines
* tools that don't care about encodings would need to learn about them
* files that appear the same will have a different hash due to presence or
  absence of an invisible character that can appear/disappear with no
  explicit request on the user's part
* with UTF-8, we're 95% there.  For BOMs, there's almost no support.

So I'm strongly against producing BOMs.  As for accepting them, there's
little that can break so it would be mostly ok... but certainly not as
a must clause.


-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812155820.ga31...@angband.pl



Re: UTF-8 in jessie

2013-08-12 Thread Vincent Lefevre
On 2013-08-12 15:16:59 +0200, Adam Borowski wrote:
 On Mon, Aug 12, 2013 at 12:50:35PM +0200, Vincent Lefevre wrote:
  On 2013-08-12 02:51:52 +0200, Adam Borowski wrote:
   Detecting non-UTF files is easy:
   * false positives are impossible
   * false negatives are extremely unlikely: combinations of letters that 
   would
 happen to match a valid utf character don't happen naturally, and even 
   if
 they did, every single combination in the file tested would need to 
   match
 valid utf.
  
  Not that unlikely, and it is rather annoying that Firefox (and
  therefore Iceweasel) gets this wrong due to an ambiguity with TIS-620.
  IMHO, in case of ambiguity, UTF-8 should always be preferred by
  default (applications could have options to change the preferences).
 
 That's the opposite of what I'm talking about: it is hard to reliably detect
 ancient encodings, because they tend to assign a character to every possible
 bit stream.  On the other hand, only certain combinations of bytes with the
 8th bit set are valid UTF-8, and thus it is possible to detect UTF-8 with
 good accuracy.  It is obviously trivial to fool such detection deliberately,
 but such combinations don't happen in real languages, and thus if something
 validates as UTF-8, it is safe to assume it indeed is.

I don't know about the exact cause making Firefox to recognize some file
as TIS-620 instead of UTF-8, but it is fooled and not deliberately.

   On the other hand, detecting text files is hard.
  
  Deciding whether a file is a text file may be hard even for a human.
  What about text files with ANSI control sequences?
 
 Same as, say, a Word97 document: not text for my purposes.  It might be
 just coloured plain text, but there is no generic way to handle that.

I think I've already seen such files as distributed text files
(documentation), or perhaps there were just backspace characters
to get bold (x\bx) and underline (x\b_). The less utility can
handle them.

  I think better questions could be: why do you want to regard a file as
  text? For what purpose(s)? For the all shipped text files in UTF-8
  rule only?
 
 A shipped config file will have some settings the user may edit and comments
 he may read.  Being able to see what's going on is a prerequisite here.

However some config files may be byte-oriented (like procmailrc, AFAIK).

 HTML can include http-equiv which take care of rendering, but editing is
 still a problem.  And if you edit it, or, say, fill in some fields from a
 database, you risk data loss.  If everything is UTF-8 end-to-end, this risk
 goes away.  (I do care about plain text more, though.)

You may still have NFC/NFD problems (this is also true for filenames).

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812172807.ga2...@ioooi.vinc17.net



Re: UTF-8 in jessie

2013-08-12 Thread Dmitrijs Ledkovs
On 12 August 2013 01:51, Adam Borowski kilob...@angband.pl wrote:
 On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote:
 I propose the following sub-goals:

 1. all programs should, in their default configuration, accept UTF-8 input
and pass it through uncorrupted.  Having to manually specify encoding
is acceptable only in a programmatic interface, GUI/std{in,out,err}/
command line/plain files should work with nothing but LC_CTYPE.

 2. all GUI/curses/etc programs should be able to display UTF-8 output where
appropriate

 3. all file names must be valid UTF-8

 4. all text files should be encoded in UTF-8


What about locales though?

* C.utf8 locale should be always available
* C.utf8 locale should be the default/fallback locale
* utf8 locale variants should be default / available / preferred
(where appropriate)

(this is rough idea, adjust above as appropriate  feasible at this
point in time)

Regards,

Dmitrijs.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/canbhluhz9rezyipz1ze5zrptb5zccrspmdf9aoaygqyjvk1...@mail.gmail.com



Re: UTF-8 in jessie

2013-08-12 Thread Vincent Lefevre
On 2013-08-12 17:58:20 +0200, Adam Borowski wrote:
 On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
  5. All programs consuning UTF8 Text must understand a BOM.
 
 I'm afraid I don't agree here: BOMs are nasty stuff that serve no purpose
 once you standardize on UTF8.  They might help with exchange with a minority
 of Windows programs, at a cost at our side.  Windows hardly does plain text:
 most of that is MSVC/etc sources, but then, the C/C++ standards explicitely
 forbid junk in places other than comments.  Most other languages expect a
 hashbang on Unix, which makes BOMs impossible.

I think that BOM has more drawbacks than advantages. It could
be useful only if there were an API to handle it correctly and
transparently, and if the current API's (open(), fopen(), etc.)
were no longer used. Basically this means that one would need a
new OS. This would also mean that a BOM could be seen as some
kind of metadata used by the new API, and having the charset in
the metadata would actually make BOM completely useless.

 Other reasons:
 * concatenating files adds a misplaced BOM
 * taking stuff from the middle loses them
 * tools like grep, patch, etc pick and insert lots of individual lines
 * tools that don't care about encodings would need to learn about them
 * files that appear the same will have a different hash due to presence or
   absence of an invisible character that can appear/disappear with no
   explicit request on the user's part
 * with UTF-8, we're 95% there.  For BOMs, there's almost no support.

This would also affect regexp, e.g. ^foo on the first line of a file.

 So I'm strongly against producing BOMs.  As for accepting them, there's
 little that can break so it would be mostly ok... but certainly not as
 a must clause.

Agreed.

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812214212.ga22...@xvii.vinc17.org



Re: UTF-8 in jessie

2013-08-12 Thread Vincent Lefevre
On 2013-08-12 20:14:30 +0100, Dmitrijs Ledkovs wrote:
 What about locales though?
 
 * C.utf8 locale should be always available
 * C.utf8 locale should be the default/fallback locale
 * utf8 locale variants should be default / available / preferred
 (where appropriate)

If scripts intend to use LC_ALL=C.UTF-8 to force everything to
the standard locale with UTF-8 support, then the glibc should
be modified to regard C.UTF-8 like C w.r.t. $LANGUAGE. I mean:

xvii% LANGUAGE=fr_FR LC_ALL=C.UTF-8 cp
cp: opérande de fichier manquant
Saisissez « cp --help » pour plus d'informations.
xvii% LANGUAGE=fr_FR LC_ALL=C cp  
cp: missing file operand
Try 'cp --help' for more information.

Both should have output in English.

-- 
Vincent Lefèvre vinc...@vinc17.net - Web: http://www.vinc17.net/
100% accessible validated (X)HTML - Blog: http://www.vinc17.net/blog/
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812215017.gb22...@xvii.vinc17.org



Re: UTF-8 in jessie

2013-08-12 Thread Charles Plessy
Le Mon, Aug 12, 2013 at 03:55:03PM +0200, Adam Borowski a écrit :
 On Mon, Aug 12, 2013 at 09:58:30AM +0200, Niels Thykier wrote:
  For the record, there is a Lintian tag for this now[1], which suggests
  only a handful of packages violates this.
  
- Recommend ASCII when possible.
- Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and 
   /usr/games.
  
  Requiring ASCII for files in $PATH should be trivial to implement as a
  separate tag.  I suppose the ASCII requirement could also be implemented
  as a pedantic check or so.  Regardless, patches welcome.  :)
 
 I disagree here: I'd want to remove any need for that recommendation
 instead.  You might have a point about files in $PATH, though.

Le Mon, Aug 12, 2013 at 03:56:48PM +, Thorsten Glaser a écrit :
 
 I disagree with requiring ASCII for $PATH though…

Hi Adam, Thorsten, and everybody,

To my knowledge, in Unstable there is currently no filename in the PATH that is
not encoded in plain ASCII.  The rationale for codifying this practice into a
requirement is to ensure that on multi-user systems, the administrator and the
users will not encounter commands that they can not display or can not type.

For file names outside the PATH, the recommendation to use ASCII when possible
should not be interpreted in an overly restrictive way: there are also good
reasons for using UTF-8 characters that are not in ASCII.

See http://bugs.debian.org/701081 for further discussion.

Have a nice day,

-- 
Charles Plessy
Tsurumi, Kanagawa, Japan


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812235702.gb9...@falafel.plessy.net



Re: UTF-8 in jessie (debhelper and BOM)

2013-08-12 Thread Osamu Aoki
Hi,

UTF-8 is a good goal indeed as principle.  

(I agree but I am struggling to update package documentation since
Japanese are known to be tough (JIS 2022/EUCJP/SHIFT-JIS/... are used)
EUC/SHIFT-JIS mixed case  can be confused with LATIN-1 easily. )

But I do not understand goal #5.  Why MUST?  Do you have rationale?

On Mon, Aug 12, 2013 at 03:50:19PM +0200, Florian Lohoff wrote:
 On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote:
  I propose the following sub-goals:
...
  4. all text files should be encoded in UTF-8

Yes.  But it will be nice to have some support by dh_installdocs :-)
  ^^

 5. All programs consuming UTF8 Text must understand a BOM.
  

I agree as SHOULD but should we state MUST? 

After all BOM has no value in UTF-8 except to upset some programs.  
See Wikipedia page: http://en.wikipedia.org/wiki/Byte_order_mark

 | The Unicode Standard permits the BOM in UTF-8, but does not require
 | or recommend its use. Byte order has no meaning in UTF-8 ...
(pointer to the Unicode document is listed there.)

If it is only for the first byte, it is relatively easy.  But there are
text data with bogus BOM in the content.  Should program understand them
to be safe, too?

FYI: I had problem recently for PO files containing lots of BOM inside
of a text file which broke running XaTeX.  Please note TeX family of
programs have more elaborate character support than Unicode only UTF-8.
I would rather have XeTeX ...)  To me, program to filter such BOM will
be nice.  But we should not shoot a good UTF-8 program for stupid BOM
containing UTF-8 data.

Osamu



-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130813044403.GB19557@goofy.localdomain



UTF-8 in jessie

2013-08-11 Thread Adam Borowski
On Mon, May 06, 2013 at 02:49:57PM +0200, Andreas Beckmann wrote:
 now might be the right time to start a discussion about release goals
 for jessie.

I would like to propose full UTF-8 support.  I don't mean here full
support for all of Unicode's finer points, merely complete eradication of
mojibake.  That is, ensuring that /m.o/ matches möo, or that ä sorts
as equal to acombining ¨ is out of scope of this proposal.

I propose the following sub-goals:

1. all programs should, in their default configuration, accept UTF-8 input
   and pass it through uncorrupted.  Having to manually specify encoding
   is acceptable only in a programmatic interface, GUI/std{in,out,err}/
   command line/plain files should work with nothing but LC_CTYPE.

2. all GUI/curses/etc programs should be able to display UTF-8 output where
   appropriate

3. all file names must be valid UTF-8

4. all text files should be encoded in UTF-8


This proposal doesn't call for eradication of non-UTF8 locales, even though
I think that's long overdue.  Josselin Mouette proposed that in #603914,
and I agree, but that's material for another flamewar.


Let's discuss the above points in depth: 

1. properly passing UTF-8

Text entered by an user should never get mangled.  These days, we can assume
mixed charsets are a thing of the past, thus there's no need of special
handling.  So are, mostly, programs that don't support it -- but due to
historic reasons, some are not configured to do so.  Thus, let's mandate
that no per-program steps are needed.

An example: let's say we have an SQL table foo(a varchar(250)).  Let's run
somesqlclient -e insert into foo values('$x'); select a from foo
(-e being whatever stands for execute this statement).

sqlite3: ok
p[ostgre]sql: ok
mysql: doesn't work!

But... the schema was declared as UTF-8, my locale is en_US.UTF-8, why
doesn't it work?  Turns out mysql requires you to call it with an extra
argument, --default-character-set=utf8.  There's no binary ABI to maintain,
compat with some historic behaviour makes no sense.  I can accept having to
specify the charset in, say, a DBI line, as that's what the API wants, but
on the command line... that's just wrong.  Am I supposed to wrap everything
with iconv, and suffer data loss on the way?  Setting LANG/LC_foo should
be enough.

Another case, perhaps more controversial, is apache.  Just take a look at
how many of Debian random project pages have mangled encodings somewhere. 
By a 0th approximation, well over one third (more for text/plain, such as
logs).  And that's with users whose skills are way above average.
These days, producing text that's not in UTF-8 can take quite a bit of
effort, especially with modern GUI tools which don't even really pay lip
service to supporting ancient charsets anymore.  Thus, if someone serves
some text in such a charset, he takes pains to even edit it.
One argument is that because AddDefaultCharset overrides http-equiv,
such old files would be mangled.  I'd say, as they already take effort
to maintain, let's let them rot in hell, as they are a rare case that
stands in the way of a nearly ubiquitous one working properly.  Such an
admin can always configure his server to use an ancient encoding if he
wishes to do so.
(The other argument, our own files shipped in /doc/, is dead since apache
2.2.22-4, and is a major part of part 4 of this proposal.)


2. GUI/curses display

With gtk, qt, and probably more, the issue is mostly moot.  Other toolkits
might require some work, but typically it's a matter of encoding (part 1 of
this proposal): characters have different horizontal widths so you use
outside functions for functionality like line wrapping already.

Not so much in curses.  Here, you have some characters take two spaces
(CJK), some take zero (zero width spaces), some take zero but must not be
detached from the previous character (combining).  The line wrapping
algorithm is actually quite simple, but needs to be implemented for every
curses program that displays arbitrary strings.  Ouch.

[I got quite some experience fixing curses/etc programs this way, so I
pledge priority help here.  gtk/qt/fooxwidgets, not so much.]


3. all file names must be UTF-8

This is quite straightforward.  They are already uninstallable on
filesystems that operate in characters rather than bytes.  Might be a good
idea to forbid nasty stuff like newlines, tabs, etc too.

I propose to apply this restriction to source packages as well.  If
Contents-* files are to be believed, the only violation is a binary package,
zero source ones, so there'd be no extra work now, and at most a repack
if an upstream regresses.  The benefit is less clear than for binaries,
but it's trivial and would prevent unexpected breakages.


4. all shipped text files in UTF-8

We don't want mojibake in provided documentation, config files, etc.  With
the amount of hackers nearby, even perl/shell/python/etc scripts in
/*/bin.  In short, all text files.

This could be done by a 

Re: UTF-8 in jessie

2013-08-11 Thread Chow Loong Jin
On Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski wrote:
 [...]
 
 On the other hand, detecting text files is hard.  The best tool so far,
 file, makes so many errors it's useless for this purpose.  One could use
 location: like, declaring stuff in /etc/ and /usr/share/doc/ to be text
 unless proven otherwise, but that's an incomplete hack.  Only hashbangs can
 be considered reliable, but scripts are not where most documentation goes.

Just a note: hashbangs can't really be considered reliable either -- consider
tarball-in-sh/other-script files (waf is a good example). Then there's stuff
like gambas-compiled executables which also ship with valid hashbangs, and
#!/usr/bin/haserl stuff which can contain lua bytecode after the hashbang line.

The only requirement for valid hashbangs, afaict, is that the first two bytes
are #!, and everything up to the \x20 or \n is resolvable to a valid filename.

 [...]


-- 
Kind regards,
Loong Jin


signature.asc
Description: Digital signature


Re: UTF-8 in jessie

2013-08-11 Thread Charles Plessy
Le Mon, Aug 12, 2013 at 02:51:52AM +0200, Adam Borowski a écrit :
 
 I would like to propose full UTF-8 support.  I don't mean here full
 support for all of Unicode's finer points, merely complete eradication of
 mojibake.

Hi Adam,

this is a great goal.  Here are two comments.

There is a related issue opened on the Policy (http://bugs.debian.org/701081),
where we propose the following:

 - Require UTF-8 for the names of all files and directories installed by binary 
packages.
 - Recommend ASCII when possible.
 - Require ASCII for files in /bin, /sbin, /usr/bin, /usr/sbin and /usr/games.

About display by GUIs, I think that we should have a system to install all the
fonts necessary to display languages that we support at the installation.

Have a nice Debconf !

-- 
Charles


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130812021825.gc6...@falafel.plessy.net