Re: charsets in debian/control

2004-12-11 Thread Shot (Piotr Szotkowski)
Hello.

Paul Hampson:

 The email address isn't important, since
 that has to be a subset of ASCII anyway.

Are the Unicode-encoded domain names
supported in (modern) browsers only?

I can surf to http://.pl/ (with, e.g., Firefox) - can I send mail to
[EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as
the Unicode in domain names is restricted to WWW only?

Cheers,
-- Shot
-- 
There's a difference between random people with stripy jumpers, and a respected
scientist with a reputation. -- Steve Kitson, ucam.chat
 http://shot.pl/hovercraft/ ===


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-11 Thread Marco d'Itri
On Dec 11, Shot (Piotr Szotkowski) [EMAIL PROTECTED] wrote:

 I can surf to http://?.pl/ (with, e.g., Firefox) - can I send mail to
 [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as
 the Unicode in domain names is restricted to WWW only?
It depends on your MUA. With mutt you can send mail to internationalized
domain names without needing to type the ASCII encoding.

-- 
ciao, |
Marco | [9705 svftIaGWGM8aU]


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-11 Thread Michal Politowski
On Sat, 11 Dec 2004 16:08:12 +0100, Shot (Piotr Szotkowski) wrote:
 Hello.
 
 Paul Hampson:
 
  The email address isn't important, since
  that has to be a subset of ASCII anyway.
 
 Are the Unicode-encoded domain names
 supported in (modern) browsers only?
 
 I can surf to http://.pl/ (with, e.g., Firefox) - can I send mail to
 [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as
 the Unicode in domain names is restricted to WWW only?

Interesting question.
 Quick check.
  Not restricted.

Of course you have to use ACE (the ASCII Compatible Encoding defined in
RFC3490) in transit (SMTP commands and message headers) but MUAs may
(and at least some indeed do) accept/display the decoded form.

aptitude search '~D^libidn'
shows many apps at least linked with the IDN library.

-- 
Micha Politowski
Talking has been known to lead to communication if practised carelessly.




Re: charsets in debian/control

2004-12-11 Thread Paul Hampson
On Sat, Dec 11, 2004 at 04:08:12PM +0100, Shot (Piotr Szotkowski) wrote:
 Hello.

 Paul Hampson:

  The email address isn't important, since
  that has to be a subset of ASCII anyway.

 Are the Unicode-encoded domain names
 supported in (modern) browsers only?
 
 I can surf to http://.pl/ (with, e.g., Firefox) - can I send mail to
 [EMAIL PROTECTED], or should I always use the [EMAIL PROTECTED] equivalent, as
 the Unicode in domain names is restricted to WWW only?

Good point. Others have pointed out that you can. And the flipside is,
can I post to [EMAIL PROTECTED]

RFC2821 says:

Local-part = Dot-string / Quoted-string
Quoted-string = DQUOTE *qcontent DQUOTE1

While the above definition for Local-part is relatively permissive, for
maximum interoperability, a host that expects to receive mail SHOULD
avoid defining mailboxes where the Local-part requires (or uses) the
Quoted-string form or where the Local-part is case-sensitive.  For any
purposes that require generating or comparing Local-parts (e.g., to
specific mailbox names), all quoted forms MUST be treated as equivalent
and the sending system SHOULD transmit the form that uses the minimum
quoting possible.

Systems MUST NOT define mailboxes in such a way as to require the use in
SMTP of non-ASCII characters (octets with the high order bit set to one)
or ASCII control characters (decimal value 0-31 and 127).  These
characters MUST NOT be used in MAIL or RCPT commands or other commands
that require mailbox names.
==

RFC2821 doesn't give more detail than that about Quoted-string, so I
presume we would have to use something like the ACE encoding used
for domain names. A quick google didn't show up anything concrete, so
I have _no_ idea what  would look like as an email box on my mail
server. I certainly think RFC2047 would be a bad idea:
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
(I have no idea what that says, I grabbed it from the RFC. It's base64
or something quite like it)

So the short answer is the email address in SMTP has to be a subset of
US-ASCII, but domain names can be handled by libidn and local-parts are
still in need of a standard.

-- 
---
Paul TBBle Hampson, MCSE
7th year CompSci/Asian Studies student, ANU
The Boss, Bubblesworth Pty Ltd (ABN: 51 095 284 361)
[EMAIL PROTECTED]

No survivors? Then where do the stories come from I wonder?
-- Capt. Jack Sparrow, Pirates of the Caribbean

This email is licensed to the recipient for non-commercial
use, duplication and distribution.
---


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-08 Thread Steve Langasek
On Tue, Dec 07, 2004 at 05:56:54PM +, Thaddeus H. Black wrote:
  But yes, non-ASCII Latin-1 chars should not be given
  special status over the national chars found in other
  languages spoken by project members.  Debian should be
  using either ASCII, or Unicode; standardizing on
  Latin-1 makes no sense in a global project.

 True.  Look, Steve: mild abuse aside, I agree with you
 in every particular.  Nevertheless, I would respectfully
 suggest that your criticism underscores my point, which
 regards the monstrous increase in complexity which the
 full Unicode standard represents.

Yet you had concluded this means we should use Latin-1 as an encoding for
the files.  All arguments that justify the use of Latin-1 characters in the
control file are equally applicable to any of a number of other national
character sets used by one or more developers.

 Consider.  Is it a bug if Readline cannot echo full bidirectional input?

Er, yes, sure it is, independently of what happens in debian/control.

 If Dselect does not appreciate all the non-spacing
 characters?

IFF dselect has a reason to display such characters, yes.  This may well be
the case regardless of whether debian/control ever supports non-ASCII
characters; Debian may start supporting localized Packages files via some
external mechanism, or it may provide a localized UI that requires these
characters.

 If Less does not regard Tibetan subjoined letters?  (This is my Tibetan
 straw man.)

Yes, this is also a bug.  Not one that's likely to be noticed for a while,
but a bug nevertheless.  But your example again overstates the complexity of
the task: the main responsibility of less is to figure out how many
characters to display on a line, and let the *terminal* render the glyphs.
This is code that needs to be implemented only once, and most of the work is
already done centrally for *all* apps by glibc which keeps track of the
display width of each character.

 Undoubtedly one might observe that the Tibetan problem
 were not really a problem with Less but rather with some
 underlying library, but this misses the point---or
 rather again it underscores the point.  Unicode solves
 what for many of us was not a problem by creating an
 entirely new class of problems.  For example, it
 requires us to be particular about how we tag our e-mail
 attachments...

Um, no.  Being part of a *global Internet* causes this problem for you.
The non-ASCII characters in your email were undefined gibberish according
to your headers; only naive (or helpful, YMMV) mail readers would render
them at all, and only naive mail readers commanded by users using a Western
European locale would have rendered them as intended.  Actually, perhaps
even that is being too generous, as there are *different* native 8-bit
encodings used on each of Unix, Windows, and MacOS; the Unix and Windows
encodings differ on relatively few codepoints, but the Mac encoding is
widely different.

And you think it's ok to inflict this same mess on anyone not using a
Latin-1 locale while trying to read a debian/control file?

 Am I arguing to jettison Unicode?  No; to the partial
 extent that I had been arguing it earlier in the thread,
 you, Peter, Daniel and Matthew have changed my mind.
 However, the typical roster of skills one masters in
 contributing broadly to Debian development is already
 awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS,
 Shell, Glibc, System calls, /proc, IPC, sockets, Sed,
 Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline,
 Ncurses, TeX, Postscript, Groff, XML, assembly, Flex,
 Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit,
 Debconf, ELF, etc.---not to mention the use of the
 English language at a sophisticated technical level.
 UTF-8 is neat, but I do not really like Unicode (you may
 have noticed this).  Seeking essential simplicity, I
 would prefer to keep the full hairy overgrown Unicode
 standard from the typical Debian roster of development
 skills.  Wouldn't you?

1) Sorry, modern software is a complex creature.  This is because we demand
complex things of it -- including handling all the languages that we speak.

2) Most DDs do not master all of the above skills.  *I* don't have a mastery
of all of the above skills; contributing broadly to Debian usually means
mastering some of these skills, and knowing where to find answers for the
rest.

3) Mastering Unicode, for the purposes of almost anyone not working
directly on glibc or implementing a terminal, is roughly equivalent to
making sure your application implements proper string handling for CJK.
If you do it right, the differences between UTF-8 and ISO-2022 are normally
minimal; if you do it wrong, you get bug reports from Japanese users.
However, for files for which no encoding is specified, there is no right way
to handle non-ASCII data, which is why debian/control is an issue.

4) As suggested above, for 98% of all applications on the system, the
encoding used for debian/control is *entirely 

Re: charsets in debian/control

2004-12-08 Thread Thaddeus H. Black
It is one thing spiritedly to argue a point against
friends and allies.  It is another to be obstinate.  I
do not wish the latter, and I admit that I am both
outnumbered and outreasoned today.  Please permit me
without malice to conform my position, which now might
be stated as follows.

  Unicode is a reasonable solution to a difficult yet
  important problem.  Broadly accepted even among
  Debian Developers from the Latin-1 countries,
  Unicode is also recognized outside Debian around a
  wider world.  Unicode is recommended for general
  Debian application.

  For non-localized purposes in which a restricted,
  byte-based character set is wanted, plain seven-bit
  ASCII is normally the logical choice.  As for
  Latin-1, although it served some needs in an earlier
  day, it must today be regarded as a local,
  incompatible encoding, not recommended for general
  international use.

I trust that you will inform me if the conformed
position yet lacks in any significant way!  Besides
expressing my own revised view, the statement also means
to summarize the subthread's key points.

Since I happen to have the attention of interested
people at the moment, I should say that I could use some
help in conforming debram's [7800 Non-English Natural
Language] division sensibly to the Unicode consensus.  I
lack the right knowledge to do it myself.  At present,
only the Latin-1 languages are sensibly differentiated
there.  The aid of a Russian (for group 7890) and a
Japanese (for group 7880) might be particularly
suitable, for instance.  (If you don't know what this is
about, it regards debtags [1].)

Turning to another matter, the responses to my impromptu
roster of Debian development skills indicate that the
roster has been taken in slightly a different manner
than I had meant it.

 ... the typical roster of skills one masters in
 contributing broadly to Debian development is ...
 awesome: C, C++, CPP, Make, Perl, Python, Autoconf,
 CVS, Shell, Glibc, System calls, /proc, IPC, sockets,
 Sed, Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline,
 Ncurses, TeX, Postscript, Groff, XML, assembly, Flex,
 Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit,
 Debconf, ELF, etc.---not to mention the use of the
 English language at a sophisticated technical level.

Although the roster may be interesting, it was meant
neither as a canonical proposal nor as a challenge.  In
fact it was just what I had happened to think of
informally at the moment.  For the record, I happen to
have a working familiarity with nineteen of the items on
my own roster, plus a limited familiarity with seven
more.  Were the roster a challenge, it would be a
foolish one, because Steve Langasek would beat me in a
Debian development contest and I know it.  As for the
other fifteen roster items, as Steve said,

 contributing broadly to Debian usually means
 mastering some of these skills, and knowing where to
 find answers for the rest.

-- 
Thaddeus H. Black
508 Nellie's Cave Road
Blacksburg, Virginia 24060, USA
+1 540 961 0920, [EMAIL PROTECTED]

1. http://debtags.alioth.debian.org


pgpWJ19KUmZqo.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-07 Thread Peter Samuelson

[Roger Leigh]
 I've been using Debian with UTF-8 only locales for over 12 months
 now.  I now consider it fine for general use, with respect to
 terminal and application support.  Unlike a couple of years ago, most
 things work perfectly.

Some apps like 'screen' do not just configure themselves for UTF-8
support based on LC_CTYPE, but have to be manually configured.
Presumably your goal would include fixing these apps.


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-07 Thread Adrian 'Dagurashibanipal' von Bidder
On Tuesday 07 December 2004 00.19, Roger Leigh wrote:

 I think going to UTF-8 as the default locale charmap for all locales
 is a feasable goal for etch, as is recoding everything to UTF-8 (where
 it makes sense).

Yep.

My biggest problem right now is 'lpr sometextfile' to a postscript printer 
(I use cups).  I *believe* the problem is not necessarily with cups itself 
but with a2ps or whatever is used to generate the postscript output.

cheers
-- vbi

-- 
Hail Eris!


pgp0lyS3by2Vw.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-07 Thread Andreas Barth
* Roger Leigh ([EMAIL PROTECTED]) [041207 00:40]:
 I think going to UTF-8 as the default locale charmap for all locales
 is a feasable goal for etch, as is recoding everything to UTF-8 (where
 it makes sense).

feasable goal and etch are the magic words I think: I agree on that,
but I don't want to claim that we are already there today.


Cheers,
Andi
-- 
   http://home.arcor.de/andreas-barth/
   PGP 1024/89FB5CE5  DC F1 85 6D A6 45 9C 0F  3B BE F1 D0 C5 D1 D9 0C




Re: charsets in debian/control

2004-12-07 Thread Maciej Dems
Patrze w ekran, a to Roger Leigh pisze do mnie:
 - No UTF-8 console keymaps
 - Some broken libraries e.g. GTK+ 1.2 [obsolete]
 - I can't paste UTF-8 into emacs (perhaps a problem in my .emacs)

- mc making mess with its frames

Maciek

-- 
M.Sc. Maciej Dems  [EMAIL PROTECTED]
-
C o m p u t e r   P h y s i c s   L a b o r a t o r y
Institute of Physics,Technical University of Lodz
ul. Wolczanska 219, 93-005 Lodz, Poland, +48426313649




Re: charsets in debian/control

2004-12-07 Thread Eugeniy Meshcheryakov
07.12.2004  13:33 +0100 Maciej Dems (-):
 Patrze w ekran, a to Roger Leigh pisze do mnie:
  - No UTF-8 console keymaps
  - Some broken libraries e.g. GTK+ 1.2 [obsolete]
  - I can't paste UTF-8 into emacs (perhaps a problem in my .emacs)
 
 - mc making mess with its frames
 
Add dselect and aptitude here.

-- 
Eugeniy Meshcheryakov

Kyiv National Taras Shevchenko University
Information and Computing Centre
http://icc.univ.kiev.ua


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-07 Thread Daniel Burrows
On Tuesday 07 December 2004 12:44 am, Peter Samuelson wrote:
  Defining the character set as utf-8 means that any non-unicode
  capable application is going to have issues, yes.

 Postulate an app that is ignorant of character sets - we'll call it
 aptitude.  Fixing it to make it accept utf-8 and spit out the correct
 encoding for its LC_CTYPE is no harder than fixing it to make it accept
 iso-8859-1 and spit out the correct encoding for its LC_CTYPE.

 And if the app already deals with charset conversions but assumes
 iso-8859-1 input, then it's trivial to fix it to assume utf-8 input.

  This is not true.

  iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset.  
Storing and manipulating iso-8859-1 strings requires no changes to internal 
datatypes (only conversions for input and output); storing and manipulating 
Unicode means you have to switch to a completely different set of 
string-handling functions for all internal operations.

  In C++ you might be able to partly finesse this by creating a replacement 
string class, but if our program (call it aptitude) is already using a 
complex replacement string class for some tasks, and this class assumes that 
characters are 8 bits wide, this might be a slightly non-trivial task, 
especially compared to handling iso-8859-1.  Hypothetically speaking. :-)

  On the other hand, once the program is using Unicode internally, taking 
iso-8859-1 as input and producing it as output should be no problem.

  Daniel

  [0] According to the libc manual, only 16 bits have been assigned, but GNU 
systems use 32-bit encoding internally if the libc transcoding functions are 
used.

-- 
/--- Daniel Burrows [EMAIL PROTECTED] --\
|  swapon /dev/ram  |
\--- News without the $$ -- National Public Radio -- http://www.npr.org ---/


pgpuGzR6Woq1o.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-07 Thread Daniel Burrows
On Tuesday 07 December 2004 10:17 am, Daniel Burrows wrote:
 complex replacement string class

  Admittedly, complex might (hypothetically) be a bit of an exaggeration.

  :P

  Daniel

-- 
/--- Daniel Burrows [EMAIL PROTECTED] --\
| You are in a maze of twisty little signatures, all alike. |
\ Evil Overlord, Inc: http://www.eviloverlord.com --/


pgpdSwtA0vgWt.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-07 Thread Richard Atterer
On Tue, Dec 07, 2004 at 10:17:17AM -0500, Daniel Burrows wrote:
 On Tuesday 07 December 2004 12:44 am, Peter Samuelson wrote:
  And if the app already deals with charset conversions but assumes
  iso-8859-1 input, then it's trivial to fix it to assume utf-8 input.
 
   This is not true.
 
   iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset.  
 Storing and manipulating iso-8859-1 strings requires no changes to internal 
 datatypes (only conversions for input and output); storing and manipulating 
 Unicode means you have to switch to a completely different set of 
 string-handling functions for all internal operations.

No, you do not have to do this. You can keep working with char, the
changes when switching to UTF-8 will mostly have to deal with the fact that
one Unicode character is represented by more than one char. This means that
you need to use a different strlen function, take care only to chop strings
of char at character boundaries, ensure that input strings are actually
valid UTF-8, etc.

Cheers,

  Richard

-- 
  __   _
  |_) /|  Richard Atterer |  GnuPG key:
  | \/¯|  http://atterer.net  |  0x888354F7
  ¯ '` ¯




Re: charsets in debian/control

2004-12-07 Thread Matthew Garrett
Daniel Burrows [EMAIL PROTECTED] wrote:

   iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset. =20
 Storing and manipulating iso-8859-1 strings requires no changes to internal=
=20
 datatypes (only conversions for input and output); storing and manipulating=
=20
 Unicode means you have to switch to a completely different set of=20
 string-handling functions for all internal operations.

utf-8 is an 8-bit encoding of unicode, using variable length characters.
Traditional string manipulation routines work fine, except in the case
where you need to know the number of characters rather than the number
of bytes. This is typically not a large number of areas of code.

   [0] According to the libc manual, only 16 bits have been assigned, but GN=
 U=20
 systems use 32-bit encoding internally if the libc transcoding functions ar=
 e=20
 used.

The libc manual is out of date. We've been using more than 16 bits for a
while.

-- 
Matthew Garrett | [EMAIL PROTECTED]




Re: charsets in debian/control

2004-12-07 Thread Daniel Burrows
On Tuesday 07 December 2004 10:40 am, Richard Atterer wrote:
 No, you do not have to do this. You can keep working with char, the
 changes when switching to UTF-8 will mostly have to deal with the fact that
 one Unicode character is represented by more than one char. This means that
 you need to use a different strlen function, take care only to chop strings
 of char at character boundaries, ensure that input strings are actually
 valid UTF-8, etc.

  This might work for programs that relatively blindly manipulate character 
strings and can pass them off to the terminal for processing.  In fact, 
aptitude does a *lot* of processing and formatting of strings internally.  
That means, for instance: splitting strings into words and paragraphs, 
truncating strings, finding out how wide strings are.

  More importantly, it also makes significant (and increasing) use of strings 
annotated with the terminal attributes of each character (think colors, 
bold/reverse video, etc).  Needless to day, it performs all of the above 
operations on those strings as well.

  All of these are impacted by extended charsets: for instance, you need to 
use a different function to find whitespace, and combining characters with 
their attributes requires the use of a structure where an integer previously 
sufficed.  That's not to mention finding the length of a string, which is 
necessary to perform most types of layout.

The changes that are necessary are at least:

  At a minimum, the class used for formatted strings will have to be 
re-targeted to support either formatted wide strings or formatted utf8 
strings.  If wide characters or are not used internally, it is also necessary 
to audit every occurrence of s.size() and check whether the length-in-memory 
or the length-in-characters of the string is being queried.  If neither wide 
characters nor a utf8-specialized basic_string are used, it is necessary to 
audit every string constructor (which might cut a substring) and make sure 
that it doesn't play havoc with utf8 codings.  Every use of isspace() and 
friends will have to be replaced with Unicode-aware equivalents.

  And that's just the problems I can think of off the top of my head.

  It's also necessary to use a completely different set of terminal i/o 
routines, but this is pretty much expected.

  None of these problems are insurmountable, of course, and I know pretty much 
how to solve must of them.  However, it's also true that none of them exist 
*at all* when using iso-8859-1, which is why I object to the comment that 
it's no harder to handle utf8 than iso-8859-1.  (in fact, if your terminal 
speaks iso-8859-1, aptitude will handle it just fine without any changes)

  Daniel

-- 
/--- Daniel Burrows [EMAIL PROTECTED] --\
|  Hi, I'm a .signature virus!  |
|  Copy me into your .signature to help me spread!  |
\ The Turtle Moves! -- http://www.lspace.org ---/


pgpJ6zHEp3AWv.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-07 Thread Thaddeus H. Black
Steve Langasek writes,

 ... most of the letters you listed here are specific
 to the IPA, which would have no use at all in a
 control file as they're not part of the writing system
 of any natural language.

Ok.

 Encodings and charsets are distinct concepts.  Just
 because the file is specified in UTF-8 *encoding* does
 not mean we suddenly have to start coping with the
 entire Unicode character set.

Right.

 Why, what a lovely straw man you have there.

No comment.

 But yes, non-ASCII Latin-1 chars should not be given
 special status over the national chars found in other
 languages spoken by project members.  Debian should be
 using either ASCII, or Unicode; standardizing on
 Latin-1 makes no sense in a global project.

True.  Look, Steve: mild abuse aside, I agree with you
in every particular.  Nevertheless, I would respectfully
suggest that your criticism underscores my point, which
regards the monstrous increase in complexity which the
full Unicode standard represents.  Consider.  Is it a
bug if Readline cannot echo full bidirectional input?
If Dselect does not appreciate all the non-spacing
characters?  If Less does not regard Tibetan subjoined
letters?  (This is my Tibetan straw man.)

Undoubtedly one might observe that the Tibetan problem
were not really a problem with Less but rather with some
underlying library, but this misses the point---or
rather again it underscores the point.  Unicode solves
what for many of us was not a problem by creating an
entirely new class of problems.  For example, it
requires us to be particular about how we tag our e-mail
attachments...

 ... to properly declare the character set on the
 non-ASCII mails you send.

We can perhaps be pardoned for feeling a little grumpy
about this.

Am I arguing to jettison Unicode?  No; to the partial
extent that I had been arguing it earlier in the thread,
you, Peter, Daniel and Matthew have changed my mind.
However, the typical roster of skills one masters in
contributing broadly to Debian development is already
awesome: C, C++, CPP, Make, Perl, Python, Autoconf, CVS,
Shell, Glibc, System calls, /proc, IPC, sockets, Sed,
Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline,
Ncurses, TeX, Postscript, Groff, XML, assembly, Flex,
Bison, ORB, Lisp, Dpkg, PAM, Xlibs, Tk, GTK, SysVInit,
Debconf, ELF, etc.---not to mention the use of the
English language at a sophisticated technical level.
UTF-8 is neat, but I do not really like Unicode (you may
have noticed this).  Seeking essential simplicity, I
would prefer to keep the full hairy overgrown Unicode
standard from the typical Debian roster of development
skills.  Wouldn't you?

-- 
Thaddeus H. Black
508 Nellie's Cave Road
Blacksburg, Virginia 24060, USA
+1 540 961 0920, [EMAIL PROTECTED]


pgpoTit3xtAci.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-07 Thread Marco d'Itri
On Dec 07, Thaddeus H. Black [EMAIL PROTECTED] wrote:

 UTF-8 is neat, but I do not really like Unicode (you may
Actually you do not even understand it, because this sentence is
meaningless.

-- 
ciao, |
Marco | [9639 coubl1Ib61SmA]


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-07 Thread Petter Reinholdtsen
[Thaddeus H. Black]
 UTF-8 is neat, but I do not really like Unicode (you may

[Marco d'Itri]
 Actually you do not even understand it, because this sentence is
 meaningless.

Perhaps he is aware of the difference between Unicode and ISO-10646?
UTF-8 is an encoding of ISO-10646.




RE: charsets in debian/control

2004-12-07 Thread Julian Mehnle
Thaddeus H. Black wrote:
 However, the typical roster of skills one masters in contributing
 broadly to Debian development is already awesome: C, C++, CPP, Make,
 Perl, Python, Autoconf, CVS, Shell, Glibc, System calls, /proc, IPC,
 sockets, Sed, Awk, Vi, Emacs, locales, Libdb, GnuPG, Readline, Ncurses,
 TeX, Postscript, Groff, XML, assembly, Flex, Bison, ORB, Lisp, Dpkg,
 PAM, Xlibs, Tk, GTK, SysVInit, Debconf, ELF, etc.---not to mention the
 use of the English language at a sophisticated technical level.

Pardon me, but I only know 18 of the 40 items you mentioned, but I don't
have a problem writing software for Debian or Linux in general.

(Some) developers having to learn (parts of) Unicode is not a _general_
problem, not the least because many already know it.  It might be a
problem for _you_in_particular_, because you do not know it and don't want
to learn it.

But that isn't a very good argument against applying a perhaps somewhat
complex technology to Debian that's well suited for the job.  Especially
since many tools that today can't handle multibyte encodings
(UTF-8/Unicode in particular) yet, will _have_to_ support it at some time
in the future anyway.  BTW, the understanding of Unicode isn't required
for most tools, mostly the understanding of UTF-8 is sufficient, and UTF-8
is trivial.

 UTF-8 is neat, but I do not really like Unicode (you may have noticed
 this).

You might like Bytext[1] better then.  SCNR ;-)

Seriously, I get the impression you don't like Unicode because _you_ don't
need it.

 Seeking essential simplicity, I would prefer to keep the full hairy
 overgrown Unicode standard from the typical Debian roster of development
 skills.  Wouldn't you?

No, I wouldn't.

References:
 1. http://www.bytext.org




Re: charsets in debian/control

2004-12-06 Thread Adrian 'Dagurashibanipal' von Bidder
On Sunday 05 December 2004 20.11, Goswin von Brederlow wrote:

 Any parser that acceps 8bit non-ascii chars
 will accept UTF-8 then. What remains is just making the UTF-8 chars
 visually correct then.

And make sure that, where character strings are modified, the multibyte 
sequences are counted right and handled correctly.

This should not be a big problem, but things like display code etc. must now 
be aware that character count == byte count does not longer hold.

-- vbi


-- 
Immer ist der Mann ein junger Mann, der einem jungen Weibe wohl
gefällt.
  -- Johann Wolfgang von Goethe (Nausikaa)


pgpmkpF8H0upU.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-06 Thread Goswin von Brederlow
Daniel Burrows [EMAIL PROTECTED] writes:

 On Sunday 05 December 2004 03:32 pm, Jose Carlos Garcia Sogo wrote:
  Would Peter permit me a mild dissent?  I prefer Latin-1.  Reason: I can
  recognize and distinguish Latin-1 characters, even when I do not always
  understand the words they spell.  Recognizing and distinguishing the
  characters is important to me.  And not just to me.  Imagine the dismay
  of a Korean user trying to read Arabic script in a control file.

  But the only field in UTF8 should be Maintainer, and that field should
 have (IMHO) also a roman transliterate for the name, if you don't use a
 latin charset (Greek, Arabic, Japanese, Chinese...)

   Well, when aptitude gets UTF8 support, it'll decode all the control fields 
 that are mainly meant for human consumption: that means at least Description 
 in addition to the Maintainer field, and maybe also Section.

I think the only field in UTF-8 in the main (english) Packages file
should be the maintainer field. There might be some discussion about
allowing the packages name in the description to be native too but I
wouldn't like that.

Now, for translated Packages files, like a chinese one, only the
description should change.

   I don't see any reason to limit ourselves in the long term by sticking to 
 Latin1 (or ASCII) just because none of us can read all of the languages that 
 are available in the extended UTF8 namespace.  If we want people to stick to 
 certain subsets of UTF8, that should be determined in Policy, not the 
 packaging software.

The software has to be able to work with translated Packages file. It
would be quite unacceptable for aptitude to show gibberish in the
description for a chinese user with a translated Packages file. So
there realy should be no limit there.

But limiting each Packages file to the subset of characters
recognisable in that language sounds like a good idea. Chinese user
probably don't want japanese in their Packages file and vice versa.

Seeing that english is the common language in Debian I would also say
that an english description is a must.

   If you want a practical concern (aside from, say, a general suspicion of 
 building policy into software tools), consider these cases:

   - Someone wants to translate the Description fields of all packages in 
 Debian into Chinese or Arabic.  What will they do if the package tools only 
 support Latin-1?

   - Someone wants to use the Debian packaging tools to create a new 
 distribution for use in China.  Again, what will they do if the package tools 
 only support Latin-1?

   Daniel

You are absolutely right, the tools should cope with everything with
the possible exception of warning/rejecting policy violations on
upload.

MfG
Goswin




Re: charsets in debian/control

2004-12-06 Thread Thaddeus H. Black
I would not disagree with Peter or Daniel.  They are
right in my view.  However, consider the following
Unicode characters:

  025A LATIN SMALL LETTER SCHWA WITH HOOK
  025E LATIN SMALL LETTER CLOSED REVERSED OPEN E
  0261 LATIN SMALL LETTER SCRIPT G
  0264 LATIN SMALL LETTER RAMS HORN
  0267 LATIN SMALL LETTER HENG WITH HOOK
  027A LATIN SMALL LETTER TURNED R WITH LONG LEG
  027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK
  0285 LATIN SMALL LETTER SQUAT REVERSED ESH
  0295 LATIN LETTER PHARYNGEAL VOICED FRICATIVE
  02A2 LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE
  FF21 FULLWIDTH LATIN CAPITAL LETTER A

We are not speaking of a stricken Polish L, a
double-accented Magyar O, or a euro sign.  We are
speaking of... well, to tell the truth I have no idea
what these letters are.  Have you?  More to the point,
should you and I learn to recognize such letters?
Should we expect basic Latin terminal fonts to cover
them?  Is it reasonable to marginalize the á's and ü's
of Latin-1 by lumping them with the squat reversed
esh?

Now, the squat reversed esh as such does not bother
me.  If you show me a picture of it and tell me what
language it is for and what sound it makes, then I will
know it.  What is important to me is to preserve the
simple Roman conception of the general-use alphabet in a
reasonable way---not for communication in a particular
language, but rather for clear, compact terminal
representation and for general international use.
Inherent in the concept are the relative fewness of the
available characters and the predictable way they are
arrayed across a page from left to right.

In my view, a terminal which cannot correctly display
the á is somewhat broken, and a user who does not
recognize the á probably should learn.  I would not
say the same with respect to the squat reversed esh.
However, this is just my view.

-- 
Thaddeus H. Black
508 Nellie's Cave Road
Blacksburg, Virginia 24060, USA
+1 540 961 0920, [EMAIL PROTECTED]


pgpC3wA9A3ASF.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-06 Thread Bruce Perens
Thaddeus H. Black wrote:
 025A LATIN SMALL LETTER SCHWA WITH HOOK
 025E LATIN SMALL LETTER CLOSED REVERSED OPEN E
 0261 LATIN SMALL LETTER SCRIPT G
 0264 LATIN SMALL LETTER RAMS HORN
 0267 LATIN SMALL LETTER HENG WITH HOOK
 027A LATIN SMALL LETTER TURNED R WITH LONG LEG
 027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK
 0285 LATIN SMALL LETTER SQUAT REVERSED ESH
 0295 LATIN LETTER PHARYNGEAL VOICED FRICATIVE
 02A2 LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE
 FF21 FULLWIDTH LATIN CAPITAL LETTER A
 


I have no idea what these letters are.
The Recording Artist Formerly Known as Prince :-)
   Bruce



Re: charsets in debian/control

2004-12-06 Thread Matthew Garrett
Thaddeus H. Black [EMAIL PROTECTED] wrote:

 We are not speaking of a stricken Polish L, a
 double-accented Magyar O, or a euro sign.  We are
 speaking of... well, to tell the truth I have no idea
 what these letters are.  Have you?  More to the point,
 should you and I learn to recognize such letters?
 Should we expect basic Latin terminal fonts to cover
 them?  Is it reasonable to marginalize the =E1's and =FC's
 of Latin-1 by lumping them with the squat reversed
 esh?

Why is it important that you recognise them? I can't see any reasonable
argument against UTF-8 that doesn't also remove anything other than
ascii.

 In my view, a terminal which cannot correctly display
 the =E1 is somewhat broken, and a user who does not
 recognize the =E1 probably should learn.  I would not
 say the same with respect to the squat reversed esh.
 However, this is just my view.

Defining the character set as utf-8 means that any non-unicode capable
application is going to have issues, yes. But so does defining the
character set as anything other than ascii - people using a non-8859-1
terminal encoding won't be able to read any of the non-ascii characters
in the file.

The only two character sets that make any sense whatsoever in the Unix
world are ascii and UTF-8. I'd be happy with either, but I've got a
fairly anglo-centric viewpoint. I can see a strong argument for
maintainers actually being allowed to spell their name properly, even if
pragmatism suggests that we want a latinised version available as well.

-- 
Matthew Garrett | [EMAIL PROTECTED]




Re: charsets in debian/control

2004-12-06 Thread Roger Leigh
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Barth [EMAIL PROTECTED] writes:

 Though I agree on your last statement (and please, remember, I'm from
 germany where non-ASCII-characters are also in common use), I still
 consider that UTF-8-not-ASCII has not finally reached ok, but it's on
 the way to it (and I consider this a good thing).

I've been using Debian with UTF-8 only locales for over 12 months now.
I now consider it fine for general use, with respect to terminal and
application support.  Unlike a couple of years ago, most things work
perfectly.  The only things I've currently found lacking are

- - No UTF-8 console keymaps
- - Some broken libraries e.g. GTK+ 1.2 [obsolete]
- - I can't paste UTF-8 into emacs (perhaps a problem in my .emacs)

I think going to UTF-8 as the default locale charmap for all locales
is a feasable goal for etch, as is recoding everything to UTF-8 (where
it makes sense).


Regards,
Roger

- -- 
Roger Leigh
Printing on GNU/Linux?  http://gimp-print.sourceforge.net/
Debian GNU/Linuxhttp://www.debian.org/
GPG Public Key: 0x25BFB848.  Please sign and encrypt your mail.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.8 http://mailcrypt.sourceforge.net/

iD8DBQFBtOj6VcFcaSW/uEgRAohjAKCNnbfpRayVrKwAd7NmfeYtntYVEgCgnPGQ
0rVgxXmc4jjBkBe+p+or9X4=
=f7lE
-END PGP SIGNATURE-




Re: charsets in debian/control

2004-12-06 Thread Steve Langasek
On Mon, Dec 06, 2004 at 06:58:10PM +, Thaddeus H. Black wrote:
 I would not disagree with Peter or Daniel.  They are
 right in my view.  However, consider the following
 Unicode characters:

   025A LATIN SMALL LETTER SCHWA WITH HOOK
   025E LATIN SMALL LETTER CLOSED REVERSED OPEN E
   0261 LATIN SMALL LETTER SCRIPT G
   0264 LATIN SMALL LETTER RAMS HORN
   0267 LATIN SMALL LETTER HENG WITH HOOK
   027A LATIN SMALL LETTER TURNED R WITH LONG LEG
   027F LATIN SMALL LETTER REVERSED R WITH FISHHOOK
   0285 LATIN SMALL LETTER SQUAT REVERSED ESH
   0295 LATIN LETTER PHARYNGEAL VOICED FRICATIVE
   02A2 LATIN LETTER REVERSED GLOTTAL STOP WITH STROKE
   FF21 FULLWIDTH LATIN CAPITAL LETTER A

 We are not speaking of a stricken Polish L, a
 double-accented Magyar O, or a euro sign.

Indeed we're not; most of the letters you listed here are specific to the
IPA, which would have no use at all in a control file as they're not part of
the writing system of any natural language.

Encodings and charsets are distinct concepts.  Just because the file is
specified in UTF-8 *encoding* does not mean we suddenly have to start coping
with the entire Unicode character set.  OTOH, the Unicode charset is also
the only one we have that is a superset of iso8859-1, iso8859-2, and
iso8859-15, so if you want to be able to *use* the , the , and the  in
the same file together with  and , the only sensible way to do so is to
specify a UTF-8 encoding.

 We are speaking of... well, to tell the truth I have no idea what these
 letters are.  Have you?  More to the point, should you and I learn to
 recognize such letters?  Should we expect basic Latin terminal fonts to
 cover them?  Is it reasonable to marginalize the ?'s and ?'s of Latin-1
 by lumping them with the squat reversed esh?

Why, what a lovely straw man you have there.

But yes, non-ASCII Latin-1 chars should not be given special status over
the national chars found in other languages spoken by project members.
Debian should be using either ASCII, or Unicode; standardizing on Latin-1
makes no sense in a global project.

 In my view, a terminal which cannot correctly display
 the ? is somewhat broken, and a user who does not
 recognize the ? probably should learn.  I would not
 say the same with respect to the squat reversed esh.
 However, this is just my view.

Mmm-hmm.

 Content-Type: text/plain; charset=unknown-8bit

Your opinion about which charset to use for Debian files would carry more
weight with me if you had enough experience with such things to properly
declare the character set on the non-ASCII mails you send.

-- 
Steve Langasek
postmodern programmer


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-06 Thread Mike Hommey
On Mon, Dec 06, 2004 at 06:53:42PM -0800, Steve Langasek [EMAIL PROTECTED] 
wrote:
 But yes, non-ASCII Latin-1 chars should not be given special status over
 the national chars found in other languages spoken by project members.
 Debian should be using either ASCII, or Unicode; standardizing on Latin-1
 makes no sense in a global project.

Actually Latin-9 would be better, it doesn't contain the useless .

Mike




Re: charsets in debian/control

2004-12-06 Thread Steve Langasek
On Tue, Dec 07, 2004 at 12:04:56PM +0900, Mike Hommey wrote:
 On Mon, Dec 06, 2004 at 06:53:42PM -0800, Steve Langasek [EMAIL PROTECTED] 
 wrote:
  But yes, non-ASCII Latin-1 chars should not be given special status over
  the national chars found in other languages spoken by project members.
  Debian should be using either ASCII, or Unicode; standardizing on Latin-1
  makes no sense in a global project.

 Actually Latin-9 would be better, it doesn't contain the useless ¤.

Standardizing on any 8-bit character set makes no sense in a global project.
:P

-- 
Steve Langasek
postmodern programmer


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-06 Thread Mike Hommey
On Mon, Dec 06, 2004 at 07:10:21PM -0800, Steve Langasek [EMAIL PROTECTED] 
wrote:
 On Tue, Dec 07, 2004 at 12:04:56PM +0900, Mike Hommey wrote:
  On Mon, Dec 06, 2004 at 06:53:42PM -0800, Steve Langasek [EMAIL 
  PROTECTED] wrote:
   But yes, non-ASCII Latin-1 chars should not be given special status over
   the national chars found in other languages spoken by project members.
   Debian should be using either ASCII, or Unicode; standardizing on Latin-1
   makes no sense in a global project.
 
  Actually Latin-9 would be better, it doesn't contain the useless 

Re: charsets in debian/control

2004-12-06 Thread Peter Samuelson

[Matthew Garrett]
 Defining the character set as utf-8 means that any non-unicode
 capable application is going to have issues, yes.

Postulate an app that is ignorant of character sets - we'll call it
aptitude.  Fixing it to make it accept utf-8 and spit out the correct
encoding for its LC_CTYPE is no harder than fixing it to make it accept
iso-8859-1 and spit out the correct encoding for its LC_CTYPE.

And if the app already deals with charset conversions but assumes
iso-8859-1 input, then it's trivial to fix it to assume utf-8 input.

Peter


signature.asc
Description: Digital signature


charsets in debian/control

2004-12-05 Thread Peter Samuelson

We seem to be moving to a de facto standard of UTF-8 for non-ASCII
characters in debian/control files.  This is not specified in Policy
[1], but for hopefully obvious reasons, consistency is a Good Thing,
and UTF-8 seems to be the best solution for this sort of thing.

In my sid control files, I see 841 lines with non-ASCII characters,
mostly (761 lines) in Maintainer and Uploaders fields:

  perl -ne 'print if m/[\x80-\xff]/' /var/lib/apt/lists/* | wc -l

Of these, 747 lines are UTF-8 and 94 lines are not.[2]

I hate to suggest a mass bug filing (33 source packages), since it's a
mere de facto standard.  And I'm certainly not in the mood to campaign
for a Policy amendment.  But it would be a Good Thing to aim for
consistency here.  Current UI tools (dpkg, dselect, apt-cache,
aptitude) seem to know nothing about character sets, and just pass
characters verbatim to the terminal, but one can easily imagine a tool
that would convert to a user's local character set when possible.

I suggest that the affected source packages[3] be run through the
command 'iconv -f ORIGINAL_CHARSET -t utf-8' as soon as convenient.
Would people support a mass bug at minor severity?

Peter

[1] Note that UTF-8 *is* recommended for debian/changelog.

http://www.debian.org/doc/debian-policy/ap-pkg-sourcepkg.html#s-pkg-dpkgchangelog

[2] It is easy to tell if text is UTF-8 or not; I use the exit status
of iconv -f utf-8 -t utf-8.  This gives very few false positives,
because UTF-8 has a very strict format.

[3] abcm2ps freecraft   maint-guide
ap-utilsgl-117  movixmaker-2
appunti-informatica-libera  glade-perl  mozilla-locale-hu
ayuda   gnustep-icons   myspell-sv
boa gridlockntfsdoc
boa-constructor gtkdiskfree pdftohtml
bombermaze  gtodo   pdp
bonsai  irispyca
cadubi  itcl3   pyro
cantus  kernel-patch-2.4.26-s390pythoncad
coq-doc kernel-patch-2.4.27-s390rat
crafted krb4strategoxt
darkstatlg-issue46  sympa
ddclientlibcgi-validate-perlsyslog-ng
doc-linux-html-pt   libconfig-general-perl  tuxeyes
doc-linux-text-pt   libexporter-lite-perl   unac
drpythonlibtext-unaccent-perl   wmblob
elmolibuniversal-exports-perl   wmnetmon
fcmplinux-ntfs  wordtrans
fortunes-fr linux-tutorial-es   wprint
fortunes-it


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Peter Samuelson

[Peter Samuelson]
 I suggest that the affected source packages[3] be run through the
 command 'iconv -f ORIGINAL_CHARSET -t utf-8' as soon as convenient.

Ehhh, I see I have already ruined my credibility by pasting the wrong
source package list.  The real list is much shorter.

Apologies,
Peter

ap-utilsglade-perl  maint-guide
appunti-informatica-libera  irismyspell-sv
ayuda   itcl3   pdp
cadubi  kernel-patch-2.4.26-s390pyca
cantus  kernel-patch-2.4.27-s390rat
crafted krb4strategoxt
doc-linux-html-pt   lg-issue46  sympa
doc-linux-text-pt   libcgi-validate-perlsyslog-ng
elmolibexporter-lite-perl   wmnetmon
fcmplibuniversal-exports-perl   wordtrans
fortunes-it linux-tutorial-es   wprint


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Petter Reinholdtsen
[Peter Samuelson]
 We seem to be moving to a de facto standard of UTF-8 for non-ASCII
 characters in debian/control files.  This is not specified in Policy
 [1], but for hopefully obvious reasons, consistency is a Good Thing,
 and UTF-8 seems to be the best solution for this sort of thing.

Some will argue that only ASCII is acceptable in debian/control files.
I am not one of these.

I agree that we should standardise on UTF-8 for both the changelog and
the control file (and the copyright file, for the upstream author and
package author names).  We need to be able to correctly represent the
names of people, and it can not be done using ASCII only.

Good to see that most packages already uses UTF-8.  I hope the
packages using other charsets can be converted to UTF-8 as soon as
possible.




Re: charsets in debian/control

2004-12-05 Thread Andreas Barth
* Petter Reinholdtsen ([EMAIL PROTECTED]) [041205 11:30]:
 [Peter Samuelson]
  We seem to be moving to a de facto standard of UTF-8 for non-ASCII
  characters in debian/control files.  This is not specified in Policy
  [1], but for hopefully obvious reasons, consistency is a Good Thing,
  and UTF-8 seems to be the best solution for this sort of thing.

 Some will argue that only ASCII is acceptable in debian/control files.
 I am not one of these.
 
 I agree that we should standardise on UTF-8 for both the changelog and
 the control file (and the copyright file, for the upstream author and
 package author names).  We need to be able to correctly represent the
 names of people, and it can not be done using ASCII only.
 
 Good to see that most packages already uses UTF-8.  I hope the
 packages using other charsets can be converted to UTF-8 as soon as
 possible.

There are different way to view that, and there is a policy bug about
that very topic.

I think most of us agree that non-UTF-8-characters are not a good idea
(please note the UTF-8-characters is a superset of ASCII).  For some
places (like package names), I think most of us even agree that only
ASCII-characters should be used. Also, there is the proposal that in
other fields (i.e. names), an translation should (also) be used if the
characters are not in some basic classes (more or less: ASCII plus
ASCII-similar letters).

So, I personally consider non-UTF-8-characters an bug, and
UTF-8-not-ASCII on the way from bug to allowed.



Cheers,
Andi
-- 
   http://home.arcor.de/andreas-barth/
   PGP 1024/89FB5CE5  DC F1 85 6D A6 45 9C 0F  3B BE F1 D0 C5 D1 D9 0C




Re: charsets in debian/control

2004-12-05 Thread Josselin Mouette
Le dimanche 05 décembre 2004 à 11:43 +0100, Andreas Barth a écrit :
 I think most of us agree that non-UTF-8-characters are not a good idea
 (please note the UTF-8-characters is a superset of ASCII).  For some
 places (like package names), I think most of us even agree that only
 ASCII-characters should be used. Also, there is the proposal that in
 other fields (i.e. names), an translation should (also) be used if the
 characters are not in some basic classes (more or less: ASCII plus
 ASCII-similar letters).
 
 So, I personally consider non-UTF-8-characters an bug, and
 UTF-8-not-ASCII on the way from bug to allowed.

Many of us have names that can't be written using ASCII. Furthermore,
the Debian tools need consistency between the developer name in the
changelog and the Maintainer/Uploaders fields in the control file. The
only way for these developers to have a policy-compliant changelog
without having their uploads considered as NMUs is to encode the control
file in UTF-8.
-- 
 .''`.   Josselin Mouette/\./\
: :' :   [EMAIL PROTECTED]
`. `'[EMAIL PROTECTED]
  `-  Debian GNU/Linux -- The power of freedom


signature.asc
Description: Ceci est une partie de message	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=


Re: charsets in debian/control

2004-12-05 Thread Andreas Barth
* Josselin Mouette ([EMAIL PROTECTED]) [041205 13:05]:
 Le dimanche 05 décembre 2004 à 11:43 +0100, Andreas Barth a écrit :
  I think most of us agree that non-UTF-8-characters are not a good idea
  (please note the UTF-8-characters is a superset of ASCII).  For some
  places (like package names), I think most of us even agree that only
  ASCII-characters should be used. Also, there is the proposal that in
  other fields (i.e. names), an translation should (also) be used if the
  characters are not in some basic classes (more or less: ASCII plus
  ASCII-similar letters).
  
  So, I personally consider non-UTF-8-characters an bug, and
  UTF-8-not-ASCII on the way from bug to allowed.
 
 Many of us have names that can't be written using ASCII. Furthermore,
 the Debian tools need consistency between the developer name in the
 changelog and the Maintainer/Uploaders fields in the control file. The
 only way for these developers to have a policy-compliant changelog
 without having their uploads considered as NMUs is to encode the control
 file in UTF-8.

Though I agree on your last statement (and please, remember, I'm from
germany where non-ASCII-characters are also in common use), I still
consider that UTF-8-not-ASCII has not finally reached ok, but it's on
the way to it (and I consider this a good thing).


Cheers,
Andi
-- 
   http://home.arcor.de/andreas-barth/
   PGP 1024/89FB5CE5  DC F1 85 6D A6 45 9C 0F  3B BE F1 D0 C5 D1 D9 0C




Re: charsets in debian/control

2004-12-05 Thread Steinar H. Gunderson
On Sun, Dec 05, 2004 at 01:01:16PM +0100, Josselin Mouette wrote:
 Many of us have names that can't be written using ASCII.

Well, they usually can be transliterated, can't they?

Transliterating is somewhat of a kludge (and I think in most cases UTF-8 is a
much better solution); OTOH I'd rapidly get confused in the list of Japanese
maintainers if their names weren't transliterated.

/* Steinar */
-- 
Homepage: http://www.sesse.net/




Re: charsets in debian/control

2004-12-05 Thread Marco d'Itri
On Dec 05, Peter Samuelson [EMAIL PROTECTED] wrote:

 Would people support a mass bug at minor severity?
Make it normal.

-- 
ciao, |
Marco | [9589 inOGrPyJFNKhM]


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Marco d'Itri
On Dec 05, Steinar H. Gunderson [EMAIL PROTECTED] wrote:

 Transliterating is somewhat of a kludge (and I think in most cases UTF-8 is a
 much better solution); OTOH I'd rapidly get confused in the list of Japanese
 maintainers if their names weren't transliterated.
This is a different issue: in an international environment, people who
write their name in a non-Latin script should also add a romanized
version.

-- 
ciao, |
Marco | [9590 titPdfXuT6SXM]


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Peter Samuelson

[Steinar H. Gunderson]
 Transliterating is somewhat of a kludge (and I think in most cases
 UTF-8 is a much better solution); OTOH I'd rapidly get confused in
 the list of Japanese maintainers if their names weren't
 transliterated.

I think it's a valid choice for a maintainer who natively speaks a
language that does not use the Roman alphabet, whether to present one's
name in the preferred form, or a Roman transliteration which will be
easier for most developers to identify.  It's an asymmetric situation,
in that people interacting with Debian development *already* have to
know a modicum of English - and thus, non-ASCII variations on the Roman
alphabet should not confound most of us in the way other writing
systems might.

In either case, at least the email address will be a clue, and a point
of contact.

Peter


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Peter Samuelson

[Marco d'Itri]
  Would people support a mass bug at minor severity?
 Make it normal.

Given that Policy recommends debian/changelog to be utf-8, coupled with
the observation (which I had not thought of) that various tools may
require a maintainer's name in debian/control and debian/changelog to
be the same - I'd agree.

I'll wait for more feedback before doing it, though.  One thing I don't
wish for is a public flogging for filing an unjustified mass bug.

Peter


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Denis Barbier
[Peter Samuelson]
 I suggest that the affected source packages[3] be run through the
 command 'iconv -f ORIGINAL_CHARSET -t utf-8' as soon as convenient.

No, as you noticed this list is short and can be processed in a more
elegant manner, e.g. sympa description uses a no-break space where a
normal space would suffice, so telling maintainer to convert to UTF-8
is not a good idea.
I filed several bugreports months ago for packages having non-ASCII
characters in their description, 3 are closed (#245592, #245594, #245596)
and 2 are still open: itcl3 (#242690) and krb4 (#242694).
IMO such bugreports are better than mass bug filing because the 3 closed
bugreports did not switch to UTF-8 but converted to ASCII instead.  This
requires a manual processing because instructions are not identical for
all bugreports.

Denis




Re: charsets in debian/control

2004-12-05 Thread Goswin von Brederlow
Josselin Mouette [EMAIL PROTECTED] writes:

 Le dimanche 05 décembre 2004 à 11:43 +0100, Andreas Barth a écrit :
 I think most of us agree that non-UTF-8-characters are not a good idea
 (please note the UTF-8-characters is a superset of ASCII).  For some
 places (like package names), I think most of us even agree that only
 ASCII-characters should be used. Also, there is the proposal that in
 other fields (i.e. names), an translation should (also) be used if the
 characters are not in some basic classes (more or less: ASCII plus
 ASCII-similar letters).
 
 So, I personally consider non-UTF-8-characters an bug, and
 UTF-8-not-ASCII on the way from bug to allowed.

 Many of us have names that can't be written using ASCII. Furthermore,
 the Debian tools need consistency between the developer name in the
 changelog and the Maintainer/Uploaders fields in the control file. The
 only way for these developers to have a policy-compliant changelog
 without having their uploads considered as NMUs is to encode the control
 file in UTF-8.
 -- 
  .''`.   Josselin Mouette/\./\
 : :' :   [EMAIL PROTECTED]
 `. `'[EMAIL PROTECTED]
   `-  Debian GNU/Linux -- The power of freedom

Which means all control file, changelog file, changes file, Packages
and Sources file parsing programs have to be truely converted to
UTF-8.

dpkg, apt, aptitude, dselect, apt-proxy, apt-cacher(?), debmirror,
debpartial-mirror, DAK, cdebootstrap, ... I guess most just work out
of luck with the mixture we have now.

We already had cdebootstrap crashes because of it (its parser was a
bit stricter than the rest).

On that note, how likely is it to hit a UTF-8 character encoding that
contains a '\n'? Any non UTF-8 aware parser would assume a new line
has started and get parse errors.

MfG
Goswin




Re: charsets in debian/control

2004-12-05 Thread Bart Schuller
On Sun, Dec 05, 2004 at 06:40:52PM +0100, Goswin von Brederlow wrote:
 On that note, how likely is it to hit a UTF-8 character encoding that
 contains a '\n'? Any non UTF-8 aware parser would assume a new line
 has started and get parse errors.

0% likely, guaranteed.

UTF-8 is *designed* to be upwards compatible with plain ASCII. Every
valid ASCII character has the same meaning in UTF-8. Every UTF-8 byte
sequence for a non-ASCII character will not contain *any* ASCII characters.

This is achieved by making sure that everything above plain ASCII has
the high bit set, not just for the first byte, but for all of them.

-- 
Bart.




Re: charsets in debian/control

2004-12-05 Thread Goswin von Brederlow
Bart Schuller [EMAIL PROTECTED] writes:

 On Sun, Dec 05, 2004 at 06:40:52PM +0100, Goswin von Brederlow wrote:
 On that note, how likely is it to hit a UTF-8 character encoding that
 contains a '\n'? Any non UTF-8 aware parser would assume a new line
 has started and get parse errors.

 0% likely, guaranteed.

 UTF-8 is *designed* to be upwards compatible with plain ASCII. Every
 valid ASCII character has the same meaning in UTF-8. Every UTF-8 byte
 sequence for a non-ASCII character will not contain *any* ASCII characters.

 This is achieved by making sure that everything above plain ASCII has
 the high bit set, not just for the first byte, but for all of them.

Ok, so no problems there. Any parser that acceps 8bit non-ascii chars
will accept UTF-8 then. What remains is just making the UTF-8 chars
visually correct then.

MfG
Goswin




Re: charsets in debian/control

2004-12-05 Thread Bernd Eckenfels
On Sun, Dec 05, 2004 at 06:40:52PM +0100, Goswin von Brederlow wrote:
 On that note, how likely is it to hit a UTF-8 character encoding that
 contains a '\n'? Any non UTF-8 aware parser would assume a new line
 has started and get parse errors.

Thats no problem. The only problem you have with UTF-8 is, that a UTF-8
reader will see illegal byte sequences in a traditionally encoded (latin1)
file.

Greetings
Bernd
-- 
  (OO)  -- [EMAIL PROTECTED] --
 ( .. )  [EMAIL PROTECTED],linux.de,debian.org}  http://www.eckes.org/
  o--o 1024D/E383CD7E  [EMAIL PROTECTED]  v:+497211603874  f:+497211606754
(OO)  When cryptography is outlawed, bayl bhgynjf jvyy unir cevinpl!




Re: charsets in debian/control

2004-12-05 Thread Thaddeus H. Black
Peter Samuelson writes,

 We seem to be moving to a de facto standard of UTF-8 for non-ASCII
 characters in debian/control files.  This is not specified in Policy
 [1], but for hopefully obvious reasons, consistency is a Good Thing,
 and UTF-8 seems to be the best solution for this sort of thing.

Would Peter permit me a mild dissent?  I prefer Latin-1.  Reason: I can
recognize and distinguish Latin-1 characters, even when I do not always
understand the words they spell.  Recognizing and distinguishing the
characters is important to me.  And not just to me.  Imagine the dismay
of a Korean user trying to read Arabic script in a control file.

Well, the Korean user can speak for himself.  Speaking for myself, ASCII
is a little too limited.  There is a proper balance to strike, and to me
Latin-1 though imperfect is about right.

Latin-1 is wrong if you speak Polish, of course, and even if you don't
speak Polish, Latin-1's lack of a euro sign is slightly annoying; but,
well, I admit that I do not really mind precisely where the line is
drawn, so long as the general simple Latin concept of writing is
preserved and the number of distinct characters represented is kept
within reasonable bounds.  Regarding only Latin, Unicode recognizes over
eight hundred Latin characters: far too many for me.  This is not
considering Cyrillic or Greek; nor even beginning to think of the
numerous very different writing systems of a wider non-Western
world---worthy writing systems which I cannot even transcribe much less
read---beautiful writing systems in which the basic Western
left-to-right, character-based, diacritically marked semantics are not
preserved.  For the Debian Project, madness lies that way.  If Latin-1
is established and used if not universally loved, then probably we
should limit our usage to it.

I do not deny that Latin-1 represents all the languages I can read, and
that this fact may color my view.  Nevertheless to me a source written
in Chinese is effectively non-free.  It might as well be a compiled
binary blob.

Actually, UTF-8 encoding as such is fine.  It uses a few extra 0xC0 and
0xC1 bytes for the Latin-1 characters (see utf-8(7)), but this does not
matter much.  The full UTF-8 domain has numerous subtle semantics which
I should like to be able to avoid, however.  UTF-8 is for Unicode, which
is to allow the representation of the languages of the world in their
own scripts.  While highly useful in its own domain, this has little to
do with Debian control files, where we probably do not want the
languages of the world represented in any event.

I would tend to recommend that untranslated Debian work, especially
control files, be limited to Latin-1.  If the Japanese maintainers
uncomplainingly transliterate their names to Latin-1 for our benefit,
then probably the rest of us should do likewise.  Whether the Latin-1 is
C0/C1-encoded as UTF-8, however, is a matter of indifference to me.

-- 
Thaddeus H. Black
508 Nellie's Cave Road
Blacksburg, Virginia 24060, USA
+1 540 961 0920, [EMAIL PROTECTED]


pgpZfqqlenkJK.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-05 Thread Jose Carlos Garcia Sogo
El dom, 05-12-2004 a las 20:16 +, Thaddeus H. Black escribi:
 Peter Samuelson writes,
 
  We seem to be moving to a de facto standard of UTF-8 for non-ASCII
  characters in debian/control files.  This is not specified in Policy
  [1], but for hopefully obvious reasons, consistency is a Good Thing,
  and UTF-8 seems to be the best solution for this sort of thing.
 
 Would Peter permit me a mild dissent?  I prefer Latin-1.  Reason: I can
 recognize and distinguish Latin-1 characters, even when I do not always
 understand the words they spell.  Recognizing and distinguishing the
 characters is important to me.  And not just to me.  Imagine the dismay
 of a Korean user trying to read Arabic script in a control file.

 But the only field in UTF8 should be Maintainer, and that field should
have (IMHO) also a roman transliterate for the name, if you don't use a
latin charset (Greek, Arabic, Japanese, Chinese...)

 So I don't really see your point here. That you can write your name in
your native alphabet doesn't mean that from now people could write their
rules files in Chinese, or whatever.

 Cheers,

-- 
Jose Carlos Garcia Sogo
   [EMAIL PROTECTED]


signature.asc
Description: Esta parte del mensaje =?ISO-8859-1?Q?est=E1?= firmada	digitalmente


Re: charsets in debian/control

2004-12-05 Thread Daniel Burrows
On Sunday 05 December 2004 03:32 pm, Jose Carlos Garcia Sogo wrote:
  Would Peter permit me a mild dissent?  I prefer Latin-1.  Reason: I can
  recognize and distinguish Latin-1 characters, even when I do not always
  understand the words they spell.  Recognizing and distinguishing the
  characters is important to me.  And not just to me.  Imagine the dismay
  of a Korean user trying to read Arabic script in a control file.

  But the only field in UTF8 should be Maintainer, and that field should
 have (IMHO) also a roman transliterate for the name, if you don't use a
 latin charset (Greek, Arabic, Japanese, Chinese...)

  Well, when aptitude gets UTF8 support, it'll decode all the control fields 
that are mainly meant for human consumption: that means at least Description 
in addition to the Maintainer field, and maybe also Section.

  I don't see any reason to limit ourselves in the long term by sticking to 
Latin1 (or ASCII) just because none of us can read all of the languages that 
are available in the extended UTF8 namespace.  If we want people to stick to 
certain subsets of UTF8, that should be determined in Policy, not the 
packaging software.

  If you want a practical concern (aside from, say, a general suspicion of 
building policy into software tools), consider these cases:

  - Someone wants to translate the Description fields of all packages in 
Debian into Chinese or Arabic.  What will they do if the package tools only 
support Latin-1?

  - Someone wants to use the Debian packaging tools to create a new 
distribution for use in China.  Again, what will they do if the package tools 
only support Latin-1?

  Daniel

-- 
/--- Daniel Burrows [EMAIL PROTECTED] --\
|We've got nothing to fear but the stuff that we're|
| afraid of! -- Fluble |
\--- Be like the kid in the movie!  Play chess! -- http://www.uschess.org --/


pgpPb1jhTqyTk.pgp
Description: PGP signature


Re: charsets in debian/control

2004-12-05 Thread Paul Hampson
On Sun, Dec 05, 2004 at 04:42:24PM -0500, Daniel Burrows wrote:
 On Sunday 05 December 2004 03:32 pm, Jose Carlos Garcia Sogo wrote:
   Would Peter permit me a mild dissent?  I prefer Latin-1.  Reason: I can
   recognize and distinguish Latin-1 characters, even when I do not always
   understand the words they spell.  Recognizing and distinguishing the
   characters is important to me.  And not just to me.  Imagine the dismay
   of a Korean user trying to read Arabic script in a control file.

   But the only field in UTF8 should be Maintainer, and that field should
  have (IMHO) also a roman transliterate for the name, if you don't use a
  latin charset (Greek, Arabic, Japanese, Chinese...)

   Well, when aptitude gets UTF8 support, it'll decode all the control fields 
 that are mainly meant for human consumption: that means at least Description 
 in addition to the Maintainer field, and maybe also Section.

   I don't see any reason to limit ourselves in the long term by sticking to 
 Latin1 (or ASCII) just because none of us can read all of the languages that 
 are available in the extended UTF8 namespace.  If we want people to stick to 
 certain subsets of UTF8, that should be determined in Policy, not the 
 packaging software.

   If you want a practical concern (aside from, say, a general suspicion of 
 building policy into software tools), consider these cases:

   - Someone wants to translate the Description fields of all packages in 
 Debian into Chinese or Arabic.  What will they do if the package tools only 
 support Latin-1?

   - Someone wants to use the Debian packaging tools to create a new 
 distribution for use in China.  Again, what will they do if the package tools 
 only support Latin-1?

Isn't there a proposal around for
Description#en: English text
Description#ja: Japanese text
etc?

I can see that this would have to be split somehow to avoid the
Packages file suddenly filling CD1 on its own, but...

-- 
Paul TBBle Hampson, [EMAIL PROTECTED]
7th year CompSci/Asian Studies student, ANU

Shorter .sig for a more eco-friendly paperless office.


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Mike Hommey
On Mon, Dec 06, 2004 at 09:54:36AM +1100, Paul Hampson [EMAIL PROTECTED] 
wrote:
 Isn't there a proposal around for
 Description#en: English text
 Description#ja: Japanese text

And you'd advocate to write the English text in latin1 and the japanese
text in euc-jp ?
Let's make it clear: 1 text file, 1 encoding.

Mike




Re: charsets in debian/control

2004-12-05 Thread Josselin Mouette
Le lundi 06 décembre 2004 à 09:26 +0900, Mike Hommey a écrit :
 On Mon, Dec 06, 2004 at 09:54:36AM +1100, Paul Hampson [EMAIL PROTECTED] 
 wrote:
  Isn't there a proposal around for
  Description#en: English text
  Description#ja: Japanese text
 
 And you'd advocate to write the English text in latin1 and the japanese
 text in euc-jp ?
 Let's make it clear: 1 text file, 1 encoding.

Well, it's already the case for the generated localized debconf template
files. Not that I believe it is a good thing...
-- 
 .''`.   Josselin Mouette/\./\
: :' :   [EMAIL PROTECTED]
`. `'[EMAIL PROTECTED]
  `-  Debian GNU/Linux -- The power of freedom


signature.asc
Description: Ceci est une partie de message	=?ISO-8859-1?Q?num=E9riquement?= =?ISO-8859-1?Q?_sign=E9e?=


Re: charsets in debian/control

2004-12-05 Thread Peter Samuelson

[Thaddeus H. Black]
 Would Peter permit me a mild dissent?  I prefer Latin-1.

Dissents are fine. (:

The reason to go with UTF-8 is for consistency.  Tools that wish to
render text onto the screen ought to be able to depend on knowing the
encoding that text is in.  See below for why I (and many others) think
UTF-8 is the right choice for an encoding to standardize on.

 I do not deny that Latin-1 represents all the languages I can read,
 and that this fact may color my view.  Nevertheless to me a source
 written in Chinese is effectively non-free.  It might as well be a
 compiled binary blob.

Consider packages intended for speakers of other languages: for
example, an Urdu dictionary.  The Description field would traditionally
describe the package both in English and in Urdu (which uses the Arabic
alphabet), and I think that's perfectly fine: the target audience can
read its description more easily, and the rest of us can read the
English.  Now extrapolate to cases involving arbitrary languages, and
this is possible only if the Description field uses an encoding of
Unicode.  (Well, one could invent an extra header to specify the
character set, but that seems pointless in the extreme.)

UTF-8 is by far the best encoding of Unicode for our purposes, since it
was designed to be compatible with tools that parse ASCII.  Other
Unicode encodings have null bytes and other ASCII values embedded in
non-ASCII characters.

You can argue, and I would agree, that the Maintainer and Uploaders
fields (the only fields other than Description where we are likely to
see non-ASCII text) ought to be written in roman letters.  People
involved with Debian development are required to know a certain amount
of English in any case, so the roman alphabet is a common denominator.
And, unlike the Description field, it's awkward to try and have both
native glyphs and a roman transliteration.  However, I see no reason to
tell Eastern Europeans that they cannot write their names natively;
interpreting Eastern European diacritics is no harder for people who
don't speak those languages than interpreting Western European
diacritics for people who don't speak those.

Peter


signature.asc
Description: Digital signature


RE: charsets in debian/control

2004-12-05 Thread Julian Mehnle
Thaddeus H. Black wrote:
 I do not deny that Latin-1 represents all the languages I can read, and
 that this fact may color my view.  Nevertheless to me a source written
 in Chinese is effectively non-free.  It might as well be a compiled
 binary blob. 

So Emacs is effectively non-free, because I don't speak Lisp.

Heh, good one! ;-)




Re: charsets in debian/control

2004-12-05 Thread Andrew Suffield
On Sun, Dec 05, 2004 at 09:32:00PM +0100, Jose Carlos Garcia Sogo wrote:
  But the only field in UTF8 should be Maintainer, and that field should
 have (IMHO) also a roman transliterate for the name, if you don't use a
 latin charset (Greek, Arabic, Japanese, Chinese...)

The transliterated field should be called 'Maintainer'. If you want
some other freaky encoding, unsupported by the older tools, put it in
a new field. Using the old field just breaks stuff for no reason.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ |
 `. `'  |
   `- --  |


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Paul Hampson
On Mon, Dec 06, 2004 at 09:26:57AM +0900, Mike Hommey wrote:
 On Mon, Dec 06, 2004 at 09:54:36AM +1100, Paul Hampson [EMAIL PROTECTED] 
 wrote:
  Isn't there a proposal around for
  Description#en: English text
  Description#ja: Japanese text

 And you'd advocate to write the English text in latin1 and the japanese
 text in euc-jp ?
 Let's make it clear: 1 text file, 1 encoding.

No, I'm advocating the whole thing be UTF-8, to support the
above sensibly. Sorry I wasn't clearer. -_-;

(On the other hand, I'd _love_ to see rules files written
in other languages, as long as the programs being called have
manpages in English. ^_^)

-- 
Paul TBBle Hampson, [EMAIL PROTECTED]
7th year CompSci/Asian Studies student, ANU

Shorter .sig for a more eco-friendly paperless office.


signature.asc
Description: Digital signature


Re: charsets in debian/control

2004-12-05 Thread Paul Hampson
On Mon, Dec 06, 2004 at 01:40:27AM +, Andrew Suffield wrote:
 On Sun, Dec 05, 2004 at 09:32:00PM +0100, Jose Carlos Garcia Sogo wrote:
   But the only field in UTF8 should be Maintainer, and that field should
  have (IMHO) also a roman transliterate for the name, if you don't use a
  latin charset (Greek, Arabic, Japanese, Chinese...)

 The transliterated field should be called 'Maintainer'. If you want
 some other freaky encoding, unsupported by the older tools, put it in
 a new field. Using the old field just breaks stuff for no reason.

I like this idea...

As I recall, people can throw new header fields in willy-nilly,
so people can as of _now_ put...
Maintainer-Name: utf-8 encoded maintainer name
for example. The email address isn't important, since that has to
be a subset of ASCII anyway.

Apart from documenting correct orthography though, I'm not sure
if this has any importance in terms of the tools... Changelogs
would presumably continue to match the Maintainer: field, and
so they would include the ASCII transliteration of the name.

-- 
Paul TBBle Hampson, [EMAIL PROTECTED]
7th year CompSci/Asian Studies student, ANU

Shorter .sig for a more eco-friendly paperless office.


signature.asc
Description: Digital signature