date:20031021

Re: Line Separator and Paragraph Separator

2003-10-21 Thread Doug Ewell

Marco Cimarosti  wrote:

> Jill Ramonsky wrote:
>> [...] I've even invented (and used) some 8-bit encodings which
>> leave the whole of Latin-1 unchanged (apart from the C1s) and use C1
>> characters a bit like "surrogate pairs" to reach the rest.
>
> Doug, are you listening? It seems there's a new clone of UTF:-)Z
> waiting for implementation!

Inventing new UTF's, or XTF's, or UTF:-)s is a little like punning, in
that there are at least three types of people:

a.  those who do it
b.  those who criticize those who do it
c.  those who belong to both (a) and (b)

Jill, I'd be interested in details of your invented encodings, just for
fun.  Please e-mail privately to avoid incurring the wrath of group (b).

Marco, remind me to tell you sometime about "Dynamic Code Pages."

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Cuneiform for Unicode - Request for Comments

2003-10-21 Thread Dean Snyder

The Initiative for Cuneiform Encoding has just posted preliminary data
associated with its proposal for the encoding of Sumero-Akakdian
cuneiform in Unicode. Details can be found at .

The sign lists published there represent an historic merging of two major
unpublished cuneiform sign lists:
 
1) The Pennsylvania Sign List, representing in turn the unification of
several historically significant cuneiform sign lists, including those of
Dr. Miguel Civil, Oriental Institute, University of Chicago, Dr. Robert
Englund, UCLA, and his colleagues at the Cuneiform Digital Library
Initiative, and Dr. Steve Tinney, University of Pennsylvania,
Pennsylvania Sumerian Dictionary.

2) And the forthcoming Mesopotamisches Zeichenlexikon by Dr. Rykle
Borger, Georg-August-Universität, Goettingen, the third edition of his
sign list in preparation for over 10 years. MZL is expected to be out in
Winter 2003-2004.

In addition, we are consulting the standard printed sign lists of Fossey,
Deimel, Labat, von Soden, Ruester & Neu, Steve, et al.

Those interested in making comments, suggestions, and corrections are
encouraged to join, if you haven't already, the cuneiform email list
hosted by the Unicode Consortium. Just send an email to
"[EMAIL PROTECTED]" with "subscribe cuneiform" in the subject line.

Our initial encoding proposal will be presented to the Unicode Technical
Committee November 3 & 4, 2003, during its international conference being
hosted at Johns Hopkins University by the Digital Hammurabi Project. Over
the next several months we plan to finesse this initial proposal before
making our final proposal to Unicode in the Spring of 2004.

We are attempting to encode all graphemically contrastive, complex, but
not compound, Sumero-Akkadian cuneiform signs beginning with the URIII
period, including signs used for writing Sumerian, Akkadian, Hittite,
Hurrian, and Elamite. For various reasons the archaic period scripts will
be added to Unicode later.
 
Along with many other contributors, much thanks, in particular, are due
the members of the ICE2 working group:

Dr. Miguel Civil, Oriental Institute, Univ. Chicago
Dr. Jerrold Cooper, Johns Hopkins University
Dr. Karljürgen Feuerherm, Wilfrid Laurier Univ.
Dr. Madeleine Fitzgerald, UCLA
Dr. Eckart Frahm, Yale Univ.
Cale Johnson, UCLA
Dr. Matthew Stolper, Oriental Institute, Univ. Chicago
Dr. Steve Tinney, Univ. Pennsylvania
Dr. Kenneth Whistler, Unicode Consortium


Respectfully,

Dean A. Snyder
Scholarly Technology Specialist
Library Digital Programs, Sheridan Libraries
Garrett Room, MSE Library, 3400 N. Charles St.
Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 410 516-6850 mobile: 410 245-7168 fax: 410-516-6229
Manager, Digital Hammurabi Project: www.jhu.edu/digitalhammurabi

Re: unicode on Linux

2003-10-21 Thread Benjamin Peterson

Edward H. Trager wrote:
> and to my knowledge Windows does not yet have grep at all ...


Oh, a curse on Bill Gates and his newfangled Micro$loth systems :)  To
_my_ knowledge, however...

There's cygwin.

Or better yet, no cygwin!  http://unxutils.sourceforge.net/

Or, GNU grep!  http://gnuwin32.sourceforge.net/
There's a different build of GNU grep here: 
http://members.ozemail.com.au/~crn/grep.html

Or, you could use the MS equivalent, findstr, which works on multibyte
characters provided it can guess the encoding from the current codepage
(i.e. you have to set code page to 932 to make it work on a shift-JIS
file, and so on).  You'd think you could use it on utf-8 by setting
codepage to 65001 but it doesn't happen for me.  On the other hand it
does recurse into directories.

Or, there's the DJGPP version of grep:  http://www.delorie.com/djgpp/

And related to it, there's the version that uses the PW32 project: 
http://pw32.sourceforge.net/

Or, there's cgrep and jgrep; but I don't know what particular encodings
they work with and I don't have the URLs to hand.

Or, there's a modified GNU grep here: 
http://www.interlog.com/~tcharron/grep.html


...and so on.  I usually use unxutils.

-- 
  Benjamin Peterson
  [EMAIL PROTECTED]

Re: GDP by language

2003-10-21 Thread Mark Davis

I don't think it is quite that simple. Look at India, for example.

Mark
__
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

- Original Message - 
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Tue, 2003 Oct 21 12:36
Subject: Re: GDP by language


> Mark Davis scripsit:
>
> > Thus they are rough figures, since different language groups will have
unequal
> > distributions of GDP; and there may be significant multilingual populations.
>
> In fact, officially multilingual countries are less likely to have polyglot
> citizens than officially monolingual ones.  The whole point of being
officially
> multilingual, after all, is to allow multiple groups of monoglots to get
> equal access to government services.  If most of your citizens are polyglots,
> you may as well choose the language that most of them can speak, even as L2 or
L3,
> as the official language.
>
> -- 
> John Cowan  http://www.ccil.org/~cowan[EMAIL PROTECTED]
> To say that Bilbo's breath was taken away is no description at all.  There are
> no words left to express his staggerment, since Men changed the language that
> they learned of elves in the days when all the world was wonderful. --The
Hobbit
>

Re: Klingons and their allies - Beyond 17 planes

2003-10-21 Thread Mark Davis

>I can even read Mark
> Davis' signature - that is, it appears correctly, I'd love to know what
> it means!


शिष्यादिच्छेत्पराजयम्

shiSyAdicchetparAjayam

shiSyAt  ‘from the student’

icchet  ‘one should desire’

parAjayam  ‘defeat’

‘A teacher should wish to be defeated by his own student in scholarship’


I got this from Peri Bhaskararao on our recent trip to India. I had said
something reminiscent of this in a talk, and he let me know of this saying -- 
which I liked -- then sent the details to me later.

Mark
__
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

- Original Message - 
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Sent: Tue, 2003 Oct 21 12:14
Subject: Re: Klingons and their allies - Beyond 17 planes


> On 18/10/2003 10:53, Peter Kirk wrote:
>
> > On 18/10/2003 08:42, Doug Ewell wrote:
> >
> >> Tom Gewecke  wrote:
> >>
> >>
> >>
> >>> The problems mentioned earlier in this thread disappear if one uses
> >>> correct html/css for websites and uses html mail rather than plain
> >>> text with the Mozilla mail client, which otherwise won't let you
> >>> choose the font for incoming mail.
> >>>
> >>
> >>
> >> For all the criticisms that people love to fling toward Outlook Express,
> >> it does let me choose the font for incoming mail.  The e-mail messages I
> >> received in Ewellic were plain text.
> >>
> >> -Doug Ewell
> >> Fullerton, California
> >> http://users.adelphia.net/~dewell/
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> > There definitely seems to be a bug in Mozilla in this area. See for
> > example http://bugzilla.mozilla.org/show_bug.cgi?id=201695 (still
> > unconfirmed - it seems that Mozilla bug reports in this area don't
> > even get looked at in six months) and
> > http://bugzilla.mozilla.org/show_bug.cgi?id=26182 (where they claim
> > the problem was fixed three years ago, but it has reappeared). Maybe
> > there is a fix by setting some user preferences in a special file, but
> > there doesn't seem to be a fix in the UI.
> >
> > I'll file a new bug - done, see
> > http://bugzilla.mozilla.org/show_bug.cgi?id=222777. It will be
> > interesting to see what response I get, if any.
> >
> >
> It turns out that my bug is a duplicate of
> http://bugzilla.mozilla.org/show_bug.cgi?id=91190 and work is in
> progress on fixing it. Meanwhile there is a workaround. UTF-8 plain text
> messages, and web pages, are displayed with the default font for the
> system locale. So change the fonts for your system locale, in my case
> "western", and your plain text UTF-8 messages will show up in that font
> - even with UTF-8 codes outside your system code page. With this
> approach (see comments 34 and 36 to bug 91190) I can even read Mark
> Davis' signature - that is, it appears correctly, I'd love to know what
> it means!
>
> -- 
> Peter Kirk
> [EMAIL PROTECTED] (personal)
> [EMAIL PROTECTED] (work)
> http://www.qaya.org/
>
>
>
>

Re: GDP by language

2003-10-21 Thread John Cowan

Mark Davis scripsit:

> Thus they are rough figures, since different language groups will have unequal
> distributions of GDP; and there may be significant multilingual populations.

In fact, officially multilingual countries are less likely to have polyglot
citizens than officially monolingual ones.  The whole point of being officially
multilingual, after all, is to allow multiple groups of monoglots to get
equal access to government services.  If most of your citizens are polyglots,
you may as well choose the language that most of them can speak, even as L2 or L3,
as the official language.

-- 
John Cowan  http://www.ccil.org/~cowan[EMAIL PROTECTED]
To say that Bilbo's breath was taken away is no description at all.  There are
no words left to express his staggerment, since Men changed the language that
they learned of elves in the days when all the world was wonderful. --The Hobbit

Re: Klingons and their allies - Beyond 17 planes

2003-10-21 Thread Peter Kirk

On 18/10/2003 10:53, Peter Kirk wrote:

On 18/10/2003 08:42, Doug Ewell wrote:

Tom Gewecke  wrote:

 

The problems mentioned earlier in this thread disappear if one uses
correct html/css for websites and uses html mail rather than plain
text with the Mozilla mail client, which otherwise won't let you
choose the font for incoming mail.
  


For all the criticisms that people love to fling toward Outlook Express,
it does let me choose the font for incoming mail.  The e-mail messages I
received in Ewellic were plain text.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/




 

There definitely seems to be a bug in Mozilla in this area. See for 
example http://bugzilla.mozilla.org/show_bug.cgi?id=201695 (still 
unconfirmed - it seems that Mozilla bug reports in this area don't 
even get looked at in six months) and 
http://bugzilla.mozilla.org/show_bug.cgi?id=26182 (where they claim 
the problem was fixed three years ago, but it has reappeared). Maybe 
there is a fix by setting some user preferences in a special file, but 
there doesn't seem to be a fix in the UI.

I'll file a new bug - done, see 
http://bugzilla.mozilla.org/show_bug.cgi?id=222777. It will be 
interesting to see what response I get, if any.


It turns out that my bug is a duplicate of 
http://bugzilla.mozilla.org/show_bug.cgi?id=91190 and work is in 
progress on fixing it. Meanwhile there is a workaround. UTF-8 plain text 
messages, and web pages, are displayed with the default font for the 
system locale. So change the fonts for your system locale, in my case 
"western", and your plain text UTF-8 messages will show up in that font 
- even with UTF-8 codes outside your system code page. With this 
approach (see comments 34 and 36 to bug 91190) I can even read Mark 
Davis' signature - that is, it appears correctly, I'd love to know what 
it means!

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Line Separator and Paragraph Separator

2003-10-21 Thread Peter Kirk

On 21/10/2003 08:12, Jill Ramonsky wrote:

Hmm.

Well, I can't say I've ever found a use for putting either a C0 or a 
C1 control into a text file, beyond the usual CR, LF and TAB. My code 
also often considers FF to be whitespace, although I've never actually 
(knowingly) encountered it in a real text file.

I just encountered a C0 control in one of John Cowan's always 
entertaining signatures, in a plain text e-mail. Well, the source was 
actually "Fran=1B)B=E7ois Yergeaus" but the "=1B" is a quoted-printable 
encoded form of U+001B. I'm not sure what the display intention was as 
Mozilla made no attempt to render it properly.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread Philippe Verdy

From: Jill Ramonsky 

> I would be more than grateful if someone could point me
> in the direction of a DEFINITVE specification which claims
> this is not the case, that the interpretion of "\n" as
> anything other than LF may be considered conformant
> behaviour.

If you had programmed for MacOS, you may know that
C compilers for that platform generate U+000A=LF for '\r'
and U+D=CR for '\n'. This is conforming to the common
use of CR as the standard line separator in text files for
MacOS.

With MacOSX, I think this has changed, as well as this was
the encoding used in the very limited SimpleText program
which limited the filesizes and did not support Unicode, but
only the MacOS native system character set and encoding.

Now this has changed since quite long: C compiler simply
don't care that source lines be terminated by CR or LF or a
combination of them. Also consoles are now common and
tools for MacOS are ported from other environments.

So this legacy encoding of end-of-lines is now quite obsolete
even on MacOS.

However in IBM MVS systems, that are EBCDIC based,
end-of-lines are encoded by a NL character, not LF and not
even CR. On them, the C/C++ language '\n' is mapped to NL,
which is the normal character used in console applications to
display end of lines.

In Java however, the mapping of '\n' and '\r' constants is NOT
bound to the underlying system, but permanently assigned to
LF and CR respectively. It's up to the console emulation layer
to adapt and display end-of-lines on the console from an input
'\n' (LF) at run-time. It's up to the File class to transcode the
'\n' constant to a physical end-of-line in actual text files.

In fact this also occurs in C/C++ in CPM/DOS/Windows systems
where an internal LF gets converted to a sequence CR+LF in the
FILE* interface of  for files opened in text mode.

So when discussing here about characters, don't use '\n' or '\r'
when you mean in fact LF=U+000A or CR=U+000D, unless you're
using a language like Java that maps these programming
constants to actual run-time characters.

Re: unicode on Linux

2003-10-21 Thread Peter Kirk

On 21/10/2003 09:56, Peter Kirk wrote:

On 21/10/2003 05:43, Stephane Bortzmeyer wrote:

...

See:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
 

In this page, Markus Kuhn is damaging his credibility by continuing to 
refer in several places to Unicode 3.0, although the page was updated 
some time after the release of Unicode 4.0. ...
This has been fixed already. Well done, Markus.

... Is the rest of this material similarly out of date?

I presume not. If anything is, then please tell Markus.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: GDP by language

2003-10-21 Thread Mark Davis

I found it very interesting too.

For countries with multiple languages it is only approximate, since there is
little data available. What I had done was to take figures from the World
Factbook as to the percentage of the population speaking a given language, and
then parcel up the GDP (from the World Bank) according to those figures. For
some cases I had to go to other sources, e.g. for India I used
http://www.censusindia.net/cendat/datatable25.html.

Thus they are rough figures, since different language groups will have unequal
distributions of GDP; and there may be significant multilingual populations.
Still, I think it is close enough to get an useful overall picture.

Mark
__
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

- Original Message - 
From: "John Cowan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Tue, 2003 Oct 21 10:20
Subject: Re: GDP by language


> Mark Davis scripsit:
>
> > BTW, some time ago I had generated a pie chart of world GDP divided up by
> > language.
> >
> > Someone on this list asked for a copy, so I posted it here in case others
might
> > find it interesting:
>
> Cool!  How do you account for officially bilingual/multilingual countries?
>
> -- 
> Do what you will,   John Cowan
>this Life's a Fiction[EMAIL PROTECTED]
> And is made up of   http://www.reutershealth.com
>Contradiction.  --William Blake  http://www.ccil.org/~cowan
>

Re: GDP by language

2003-10-21 Thread Mark Davis

It is PPP. (You get a very different pie chart with other measures of GDP, of
course).

Mark
__
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

- Original Message - 
From: "Patrick Andries" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>
Sent: Tue, 2003 Oct 21 10:07
Subject: Re: GDP by language


>
> - Original Message - 
> From: "Mark Davis" <[EMAIL PROTECTED]>
>
>
> > BTW, some time ago I had generated a pie chart of world GDP divided up by
> > language.
>
> Is this in purchasing power parity or currency parity ?
>
> > Someone on this list asked for a copy, so I posted it here in case others
> might
> > find it interesting:
> >
> > http://www.macchiato.com/economy/GDP_PPP_by_language.pdf
> >
> > Mark
> > __
>
> P. A.
>
>
>

Re: unicode on Linux

2003-10-21 Thread Andy Heninger

Edward H. Trager wrote:
On Tuesday 2003.10.21 14:43:43 +0200, Stephane Bortzmeyer wrote:
3) grep: no Unicode regexp


What about ICU's regexp package? 
(http://oss.software.ibm.com/icu/userguide/regexp.html)
You should be able to use ICU on *any* platform. 
ICU does have unicode regular expressions, but it's a library with an 
API, not a grep tool.  ICU  does have a simple grep-like sample, but 
it's intended only as an illustration of how to use the regexp API, and 
lacks nearly all the command line options one would expect in a real 
grep replacement.

and to my knowledge Windows does not yet have grep at all ...

Cygwin!   http://www.cygwin.com/

  -- Andy Heninger
 [EMAIL PROTECTED]

Re: GDP by language

2003-10-21 Thread John Cowan

Mark Davis scripsit:

> BTW, some time ago I had generated a pie chart of world GDP divided up by
> language.
> 
> Someone on this list asked for a copy, so I posted it here in case others might
> find it interesting:

Cool!  How do you account for officially bilingual/multilingual countries?

-- 
Do what you will,   John Cowan
   this Life's a Fiction[EMAIL PROTECTED]
And is made up of   http://www.reutershealth.com
   Contradiction.  --William Blake  http://www.ccil.org/~cowan

Re: Line Separator and Paragraph Separator

2003-10-21 Thread John Burger

My code also often considers FF to be whitespace, although I've never 
actually (knowingly) ecountered it in a real text file.
I have. It shows up in a lot of old text files as a page separator 
character.
It's still useful as way to force lpr (Unix plain-text print utility) 
to start a new page.

- John Burger
  MITRE

GDP by language

2003-10-21 Thread Mark Davis

BTW, some time ago I had generated a pie chart of world GDP divided up by
language.

Someone on this list asked for a copy, so I posted it here in case others might
find it interesting:

http://www.macchiato.com/economy/GDP_PPP_by_language.pdf

Mark
__
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

Re: unicode on Linux

2003-10-21 Thread Peter Kirk

On 21/10/2003 08:32, Edward H. Trager wrote:

...

and to my knowledge Windows does not yet have grep at all ...

 

There seem to be various Windows ports of grep available. I have been 
using GNU grep on Windows for many years. Well, technically in a DOS box 
and a Windows 2000 pseudo-DOS box, but then is there a GUI-based grep on 
any platform?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: GDP by language

2003-10-21 Thread Patrick Andries


- Original Message - 
From: "Mark Davis" <[EMAIL PROTECTED]>


> BTW, some time ago I had generated a pie chart of world GDP divided up by
> language.

Is this in purchasing power parity or currency parity ?

> Someone on this list asked for a copy, so I posted it here in case others
might
> find it interesting:
>
> http://www.macchiato.com/economy/GDP_PPP_by_language.pdf
>
> Mark
> __

P. A.

Re: unicode on Linux

2003-10-21 Thread Peter Kirk

On 21/10/2003 05:43, Stephane Bortzmeyer wrote:

...

See:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
 

In this page, Markus Kuhn is damaging his credibility by continuing to 
refer in several places to Unicode 3.0, although the page was updated 
some time after the release of Unicode 4.0. Is the rest of this material 
similarly out of date?

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Line Separator and Paragraph Separator

2003-10-21 Thread John Cowan

Jill Ramonsky scripsit:

> Well, I can't say I've ever found a use for putting either a C0 or a C1 
> control into a text file, beyond the usual CR, LF and TAB. My code also 
> often considers FF to be whitespace, although I've never actually 
> (knowingly) encountered it in a real text file.

All RFCs contain FFs at the end of each page.

> I would have thought that low codepoints would be highly valuable 
> commodities. Though some may have exotic uses, my experience is that 
> most of them don't seem to be used. 

That's why Microsoft felt free to reassign 80-9F to graphic characters in
its various codepages, which means they cannot reliably be sent across
serial transmission lines, which is what most control characters were
intended for.

> I have tended to treat the control characters rather like the 
> Private Use Area - a space in which I can do what I want so long as 
> don't expect the "outside world" to agree.

Indeed, that's safe enough.  But Unicode is all about interchange, so if
it reassigned any ISO controls, it would step on other uses of them.

> I've even invented (and used) 
> some 8-bit encodings which leave the whole of Latin-1 unchanged (apart 
> from the C1s) and use C1 characters a bit like "surrogate pairs" to 
> reach the rest. 

There is actually a standards-compliant way to achieve "code extension"
of that type, for up to 4 spaces of 94+96 characters each (you are not
allowed to redefine space or DEL):

o   use 0E (shift out) to switch to the 2nd space
o   use 0F (shift in) to return to the main space
o   use 8E (single shift 2) to mark the next byte as being in the 3rd space
o   use 8F (single shift 3) to mark the next byte as being in the 4th space

If you need more than 4 spaces, you then enter the Great Pain, also known
as ISO 2022.

-- 
Evolutionary psychology is the theory   John Cowan
that men are nothing but horn-dogs, http://www.ccil.org/~cowan
and that women only want them for their money.  http://www.reutershealth.com
--Susan McCarthy (adapted)  [EMAIL PROTECTED]

RE: Line Separator and Paragraph Separator

2003-10-21 Thread Marco Cimarosti

Jill Ramonsky wrote:
> [...] I've even invented (and used) some 8-bit encodings which
> leave the whole of Latin-1 unchanged (apart from the C1s) and use C1
> characters a bit like "surrogate pairs" to reach the rest.

Doug, are you listening? It seems there's a new clone of UTF:-)Z waiting for
implementation!

_ Marco

Re: [OT] b and d with horizontal bar in Mahorais

2003-10-21 Thread Patrick Andries




 

De: "Patrick Andries" <[EMAIL PROTECTED]>
 
> In the case of Malgasy writing system (not 
Mayotte sorry) someone suggested> to me that it may be phonetically 
equivalent to U+0253, the sample I have is> « Kib-osi kimaôre en 
orthographe malgache »  which means « Malgache de> Mayotte » 
(Mayotte being a French island off Madagascar). I will check the> 
actually value and usage of this b in Malgasy.> 
I got some news from the author of this notation, 
he tells me the b- and d- (bar across the lower part of the vertical stroke) are 
glyphic variant to the implosives « ɓ» U+0253 and « ɗ » U+0257. 
The forms with bars where chosen for pragmatic reasons, they can easily be typed 
« b » and « d », if 
necessary.  A few books have been printed using the 
notation.
 
 
(Full message in French on 
Unicode-Afrique)

Re: unicode on Linux

2003-10-21 Thread Edward H. Trager

On Tuesday 2003.10.21 14:43:43 +0200, Stephane Bortzmeyer wrote:
> On Mon, Oct 20, 2003 at 10:14:22PM +0200,
>  Stefan Persson <[EMAIL PROTECTED]> wrote 
>  a message of 23 lines which said:
> 
> > >Just wondering if anybody knowss how unicode is on Linux?
> > >
> > Very good support.
> 
> Very optimistic.
> 
> Kernel
> *
> 
> 1) File names in Unicode: no (well, the Linux kernel is 8-bits clean
> so you can always encode in UTF-8, but the kernel does not do any
> normalization and the applications do not expect UTF-8, for instance
> ls sorts alphabetically but dot not know Unicode sorting).
> 

I think there can be big debates about
whether a Linux (or any *nix kernel, for that matter) has any business normalizing 
file names.  Personally I think Unicode normalization is not the kernel's business.  
This is better left to the userland applications.

Are you sure about ls?  ls should sort UTF-8-encoded file names in raw Unicode order, 
n'est-ce pas?  Of course, that may not be what one wants!  Take Chinese for example:
there are many different methods for sorting Chinese used in Chinese dictionaries 
(phonetic, radical+stroke count, four corner method, ... ).  The order of the unified 
Hanzi/Kanji in Unicode used the Kangxi (stroke-order based) dictionary as a primary
basis, and the Dai Kanwa Ziten as a secondary basis.  So the result is a hybrid Chinese
plus Japanese ordering.  Plus, the CJK Joint Research Group had to deal with the 
placement
of all of the simplified Chinese characters that were not listed in the historical 
KangXi
dictionary (originally compiled between 1710-1716). It is nice that Unicode in some
sense preserves the great tradition first established by the KangXi ZiDian, but that 
sort order
may not be what any one modern native Chinese, Japanese, or other user needs or wants 
for
his particular purpose.  Similar stories exist for other scripts and languages.

> 2) User names: worse since utilities to create an account refuses
> UTF-8.
> 
> Applications
> 
> 
> 3) grep: no Unicode regexp

What about ICU's regexp package? 
(http://oss.software.ibm.com/icu/userguide/regexp.html)
You should be able to use ICU on *any* platform. 
Linux does not yet having a Unicode grep
and to my knowledge Windows does not yet have grep at all ...

Most of my pattern searching and string manipulation needs
-- which includes searching through documents and data encoded in UTF-8 --
are fully met using egrep and Perl (I happen to use Linux, but of course
Perl is available on every platform).  So it is clear that everything
depends on evaluating one's needs, and then figuring out which software
will meet those needs. There is now enough Unicode-aware software on Linux
to meet many people's needs. See http://eyegene.ophthy.med.umich.edu/unicode/.

> 
> 4) xterm (or similar virtual terminals): No BiDi support at all

Use mlterm instead.  It has BiDi support and support for complex text
layout as required for Arabic, Indic, and Indic-derived scripts.  See
http://eyegene.ophthy.med.umich.edu/unicode/#termemulator .

> 5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
> that have Unicode character semantics (back-character should move one
> character, not one byte)
> 
> 6) databases: I'm not aware of a free DBMS which has support for
> Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).
> 

I thought both Postgres and MySQL already have, or are working on this
issue? 

> 7) Serious word processing: LaTeX has only very minimum Unicode

Many would argue that Open Office 1.1 needs to be included in the 
category of "serious word processing" and it has good 
Unicode support.

> 
> Also, many applications (exmh, emacs) are ten times slower when
> running in UTF-8 mode.
>

exmh is written in Tcl/Tk: isn't everything written in Tcl/Tk sssllowww?
When was the last time that it really mattered how fast your
editor worked?  If emacs is slow, use vi ;-)  ...  oops, I forgot
this might provoke some people (it's a joke)!

> At the present time, using Unicode on Unix is an act of faith.

That is not an accurate statement.

Are you talking about proprietary Unixes or Linux?  I thought the
questions were about support on Linux. With regard to Unicode support on
Linux, I completely disagree with you.  I use Unicode for serious 
work on Linux everyday.

Clearly it really depends on what you want to do.  And that is the case
on other OSes as well.  

> 
> >  Default charset for recent versions of some popular distributions.
> 
> Yes, RedHat changed the default charset to Unicode without thinking
> that text files were no longer readable. 
> 
> See:
> 
> http://www.cl.cam.ac.uk/~mgk25/unicode.html
> ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
> http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.html
>

RE: Line Separator and Paragraph Separator

2003-10-21 Thread Elliotte Rusty Harold

At 4:12 PM +0100 10/21/03, Jill Ramonsky wrote:
Hmm.

Well, I can't say I've ever found a use for putting either a C0 or a 
C1 control into a text file, beyond the usual CR, LF and TAB. My 
code also often considers FF to be whitespace, although I've never 
actually (knowingly) encountered it in a real text file.
I have. It shows up in a lot of old text files as a page separator 
character. It's also occasionally used as a document separator when 
someone wants to stuff multiple XML documents in the same file.

--

  Elliotte Rusty Harold
  [EMAIL PROTECTED]
  Processing XML with Java (Addison-Wesley, 2002)
  http://www.cafeconleche.org/books/xmljava
  http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA

RE: Line Separator and Paragraph Separator

2003-10-21 Thread Jill Ramonsky

Hmm.

Well, I can't say I've ever found a use for putting either a C0 or a C1 
control into a text file, beyond the usual CR, LF and TAB. My code also 
often considers FF to be whitespace, although I've never actually 
(knowingly) encountered it in a real text file.

I would have thought that low codepoints would be highly valuable 
commodities. Though some may have exotic uses, my experience is that 
most of them don't seem to be used. In the past (that is, in the 
pre-Unicode days, or when specifically working with ASCII or Latin-1 
strings), I have tended to treat the control characters rather like the 
Private Use Area - a space in which I can do what I want so long as 
don't expect the "outside world" to agree. I've even invented (and used) 
some 8-bit encodings which leave the whole of Latin-1 unchanged (apart 
from the C1s) and use C1 characters a bit like "surrogate pairs" to 
reach the rest. (I didn't expect this to catch on, it was for internal 
use only).

I'm really surprised that Unicode "didn't want to go there".
Still, that's life.
Jill
> -Original Message-
> From: John Cowan [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 21, 2003 1:58 PM
> To: Jill Ramonsky
> Cc: [EMAIL PROTECTED]
> Subject: Re: Line Separator and Paragraph Separator
>
>
> Jill Ramonsky scripsit:
>
> > I wonder why it was not felt a good idea at the time (the
> early 1990s)
> > to have defined LS and PS, but with codepoints somewhere in
> the range
> > U+00 to U+1F.
>
> Pretty much because other ISO standards specify the meaning
> of that set,
> and Unicode/ISO 10646 very much didn't want to go there.  I
> say "meaning",
> but there are actually multiple possible meanings, though
> most of them are
> fairly consistent.

Re: unicode on Linux

2003-10-21 Thread Edward H. Trager

On Monday 2003.10.20 13:31:49 -0700, Shao, Yiying wrote:
> Thanks for your info.  
> 
> >>Just wondering if anybody knowss how unicode is on Linux?
> >>
> >Very good support.  Default charset for recent versions of some popular 
> distributions.
> 
> What are those popular distributions and which version?  
> 
> 
> >>On Red Hat Linux, if UTF-8 is not made as the default encoding for 
> >>Chnese/Japanese/Korean, what it is using for those double byte languages? 
> 
> >The old multi-byte character sets.
> 
> So, how should I implement my code?  Do I have to say if this is Japanese (for 
> example), convert the unicode (UTF-8) to multi-byte character?  That seems very 
> painful.
>  
No.  Forget about old multi-byte encodings.  Just set your locale to a UTF-8 locale 
and use UTF-8 
for all languages.  In my experience (on SuSE 7.3, 8.1, 8.2, and the 9.0 betas) all of 
the "important"
applications handle CJK languages perfectly well under a UTF-8 locale.  The 
"important" applications
for me are things like Open Office 1.1, Konsole, vim, MySQL, and Mozilla.  For CJK 
input, use SCIM
(http://ns.turbolinux.com.cn/~suzhe/scim/index.html).  For many other details about 
Unicode
on Linux, see my page at http://eyegene.ophthy.med.umich.edu/unicode/index.html.

> >>Does later Red Had Linux makes the UTF-8 the default encoding for them?
> 
> AFAIK only if you manually set it to a UTF-8 locale, e.g. 
> LANG=zh-CN.UTF-8.  Notice, though, that some older software will not be 
> aware of this change, so many characters will not be displayed properly.
> 
> So, is this setting available from Red Hat 8.0 or later?  Also, you mean some old 
> version of Linux may not aware of this setting?
> 
> 
> Besides, do you happen to know ICU from IBM?  Does it take care of the unicode 
> problems with double byte language for Linux? 

Most likely.  But I think your life will be easier if you just use UTF-8 for all 
languages and forget about legacy
encodings.  I'm sure ICU must have very robust UTF-8 support.

> 
> Thanks,
> Yiying

RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread Kent Karlsson

John Cowan wrote:
> In addition, the Rationale makes clear that internal newlines can be
> mapped to anything appropriate on output, including CR/LF and padding
> with blank spaces to fit into a card reader/punch environment:

Only for "text mode IO". Not for binary mode IO. If you want to
portably(*) deal
with Unicode IO, esp. UTF-16 and UTF-32 (either BE or LE), but also
UTF-8, use
binary mode IO.

/kent k

---
(*) Portable here means: working the same way for the same input on each
platform, not "platform dependent".

smime.p7s
Description: S/MIME cryptographic signature

RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread Winkler, Arnold F

Jill,

The standard is available at
http://www.techstreet.com/cgi-bin/detail?product_id=232462

It is a bargain, the PDF file goes for $18.00 (yes, eighteen USD).  The
printed version is somewhat more expensive, $220.00.

Go order it, and your desire for a reference will be satisfied.

Regards
Arnold

==

-Original Message-
From: Jill Ramonsky [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 21, 2003 8:31 AM
To: [EMAIL PROTECTED]
Subject: RE: Backslash n [OT] was Line Separator and Paragraph Separator



I am very happy to be corrected.
Thank you very much.

I would also greatly appreciate the "chapter and verse" ... not because 
I want to carry on arguing (I don't), but simply because I would very 
much like to have that standard available to me as a reference work.

Thanks again, and my apologies John,
Jill


 > -Original Message-
 > From: John Cowan [mailto:[EMAIL PROTECTED]
 > Sent: Tuesday, October 21, 2003 1:19 PM
 > To: Jill Ramonsky
 > Cc: [EMAIL PROTECTED]
 > Subject: Re: Backslash n [OT] was Line Separator and
 > Paragraph Separator
 >
 >
 > Jill Ramonsky scripsit:
 >
 > > This is axiomatically *THE* definition. Period. Everything else is
 > > merely quoting, rephrasing or reinterpretting this original.
 >
 > Absolutely not.  The *standard* for the C programming language is now
 > ISO/IEC 9899.


 > Anyone have the standard handy to quote chapter and verse?
 >

Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread Jonathan Coxhead

   On 21 Oct 2003, at 12:01, Jill Ramonsky wrote:

> I would be more than grateful if someone could point me in the direction 
> of a DEFINITVE specification which claims this is not the case, that the 
> interpretion of "\n" as anything other than LF may be considered 
> conformant behaviour.

   The C standard is the relevant place ...

5.2.2 Character display semantics

1 [...]

2 Alphabetic escape sequences representing nongraphic characters in the 
execution character set are intended to produce actions on display devices as 
follows:

\a (alert) Produces an audible or visible alert. The active position shall not 
be changed.

\b (backspace) Moves the active position to the previous position on the 
current line. If the active position is at the initial position of a line, the 
behavior is unspecified.

\f (form feed) Moves the active position to the initial position at the start 
of the next logical page.

\n (new line) Moves the active position to the initial position of the next 
line.

\r (carriage return) Moves the active position to the initial position of the 
current line.

\t (horizontal tab) Moves the active position to the next horizontal tabulation
position on the current line. If the active position is at or past the last defined
horizontal tabulation position, the behavior is unspecified.

\v (vertical tab) Moves the active position to the initial position of the next 
vertical tabulation position. If the active position is at or past the last 
defined vertical tabulation position, the behavior is unspecified.

3 Each of these escape sequences shall produce a unique implementation-defined 
value which can be stored in a single char object. *** The external 
representations in a text file need not be identical to the internal 
representations, and are outside the scope of this International Standard. ***

   (My emphasis ... JRC)

   The execution character set is itself also not specified ... it need not be 
ASCII.

   Hope this helps ...

/|
 o o o (_|/
/|
   (_/

Re: Line Separator and Paragraph Separator

2003-10-21 Thread John Cowan

Jill Ramonsky scripsit:

> I wonder why it was not felt a good idea at the time (the early 1990s) 
> to have defined LS and PS, but with codepoints somewhere in the range 
> U+00 to U+1F. 

Pretty much because other ISO standards specify the meaning of that set,
and Unicode/ISO 10646 very much didn't want to go there.  I say "meaning",
but there are actually multiple possible meanings, though most of them are
fairly consistent.

> I'm not surprised that NEL never caught on though.

Note that the presence of NEL in the C1 area (U+0080 to U+009F) reflects
an earlier attempt to do the same thing that generated LS.  Some ISO
committee recognized that LF was being overloaded to mean "move to the
next line" and "go back to the beginning, then move to the next line"
and introduced the characters U+0084 (IND) and U+0085 (NEL) to
disambiguate, presumably in hopes that LF would eventually be abandoned
in favor of IND and NEL as appropriate.

No such luck, Doc.  

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
I am he that buries his friends alive and drowns them and draws them
alive again from the water. I came from the end of a bag, but no bag
went over me.  I am the friend of bears and the guest of eagles. I am
Ringwinner and Luckwearer; and I am Barrel-rider.  --Bilbo to Smaug

Re: unicode on Linux

2003-10-21 Thread Stephane Bortzmeyer

On Mon, Oct 20, 2003 at 10:14:22PM +0200,
 Stefan Persson <[EMAIL PROTECTED]> wrote 
 a message of 23 lines which said:

> >Just wondering if anybody knowss how unicode is on Linux?
> >
> Very good support.

Very optimistic.

Kernel
*

1) File names in Unicode: no (well, the Linux kernel is 8-bits clean
so you can always encode in UTF-8, but the kernel does not do any
normalization and the applications do not expect UTF-8, for instance
ls sorts alphabetically but dot not know Unicode sorting).

2) User names: worse since utilities to create an account refuses
UTF-8.

Applications


3) grep: no Unicode regexp

4) xterm (or similar virtual terminals): No BiDi support at all

5) shells: I'm not aware of any line-editing shell (zsh, tcsh)
that have Unicode character semantics (back-character should move one
character, not one byte)

6) databases: I'm not aware of a free DBMS which has support for
Unicode sorting (SQL's ORDER BY) or regexps (SQL's LIKE).

7) Serious word processing: LaTeX has only very minimum Unicode

Also, many applications (exmh, emacs) are ten times slower when
running in UTF-8 mode.

At the present time, using Unicode on Unix is an act of faith.

>  Default charset for recent versions of some popular distributions.

Yes, RedHat changed the default charset to Unicode without thinking
that text files were no longer readable. 

See:

http://www.cl.cam.ac.uk/~mgk25/unicode.html
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.html

RE: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread Jill Ramonsky

I am very happy to be corrected.
Thank you very much.
I would also greatly appreciate the "chapter and verse" ... not because 
I want to carry on arguing (I don't), but simply because I would very 
much like to have that standard available to me as a reference work.

Thanks again, and my apologies John,
Jill
> -Original Message-
> From: John Cowan [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, October 21, 2003 1:19 PM
> To: Jill Ramonsky
> Cc: [EMAIL PROTECTED]
> Subject: Re: Backslash n [OT] was Line Separator and
> Paragraph Separator
>
>
> Jill Ramonsky scripsit:
>
> > This is axiomatically *THE* definition. Period. Everything else is
> > merely quoting, rephrasing or reinterpretting this original.
>
> Absolutely not.  The *standard* for the C programming language is now
> ISO/IEC 9899.
> Anyone have the standard handy to quote chapter and verse?
>

Re: Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread John Cowan

Jill Ramonsky scripsit:

> This is axiomatically *THE* definition. Period. Everything else is 
> merely quoting, rephrasing or reinterpretting this original.

Absolutely not.  The *standard* for the C programming language is now
ISO/IEC 9899.  The 2nd edition of K & R, much-beloved as it is, is just
two guys' interpretation of that standard, as the book itself makes clear.
What they say possesses a peculiar interest, but not a peculiar authority.

The standard itself is not on line, but the Rationale, which was
written by the same working group at the same time, is on line at
std.dkuug.dk/JTC1/SC22/WG14/www/docs/n850.ps .  It makes quite clear that
*any* character set that contains the necessary characters is appropriate
for C:

# There was strong sentiment that C should not be tied to ASCII, despite
# its heritage and despite the precedent of Ada being defined in terms
# of ASCII.  Rather, an implementation is required to provide a unique
# character code for each of the printable graphics used by C, and for
# each of the 40 control codes representable by an escape sequence.  [...]
# Translation and execution environments may have different character sets,
# but each must meet this requirement in its own way.  

In addition, the Rationale makes clear that internal newlines can be
mapped to anything appropriate on output, including CR/LF and padding
with blank spaces to fit into a card reader/punch environment:

# In the UNIX model, division of a file into lines is effected by newline
# characters. Different techniques are used by other systems: lines may
# be separated by CR-LF (carriage return, line feed) or by unrecorded
# areas on the recording medium; or each line may be prefixed by its
# length.  The Standard addresses this diversity by specifying that newline
# be used as a line separator at the program level, but then permitting an
# implementation to transform the data read or written to conform to the
# conventions of the environment.  Some environments represent text lines as
# blank-filled fixed-length records.  Thus the Standard specifies that it is
# implementation-defined whether trailing blanks are removed from a line on
# input.  (This specification also addresses the problems of environments
# which represent text as variable-length records, but do not allow a
# record length of 0: an empty line may be written as a one-character
# record containing a blank, and the blank is stripped on input.)

Anyone have the standard handy to quote chapter and verse?

-- 
Híggledy-pìggledy / XML programmersJohn Cowan
Try to escape those / I-eighteen-N woes;http://www.ccil.org/~cowan
Incontrovertibly / What we need more of is  http://www.reutershealth.com
Unicode weenies and / Fran)Bçois Yergeaus.[EMAIL PROTECTED]

RE: Line Separator and Paragraph Separator

2003-10-21 Thread Jill Ramonsky

Interesting.

I do strongly suspect, however, that at least part of the reason that LS 
and PS didn't take off was that they are more than seven bits wide, and 
hence cannot be transported in plain ASCII text.

I wonder why it was not felt a good idea at the time (the early 1990s) 
to have defined LS and PS, but with codepoints somewhere in the range 
U+00 to U+1F. I think it would have been fairly easy to find some mostly 
unused ones, for example U+10 and U+11. The reason? SMTP traffic is (by 
definition) transmitted across 7-bit-wide channels. HTTP traffic is 
transmitted across 8-bit wide channels. In the internet world, "newline" 
is CRLF, and everything else has to be converted to it for transmission 
across the internet.

Personally, I would have added a THIRD kind of separator, a "soft line 
break". The reason? Some email relays insist on a "maximum line length" 
of emails. In these days of mime types and attachments, we inject CRLF 
into the files to keep such relays happy, but the renderer ignores them 
as "just whitespace". If we'd have had a "soft line break" character (in 
the range U+00 to U+1F), we could have retrofitted it into existing 
email protocols. Had we done this, SLB could have been considered "just 
whitespace", while LS and PS would have been not-ignorable in HTML (and 
in fact, equivalent to  and  respectively).

I'm not surprised that NEL never caught on though.

Jill



> -Original Message-
> From: Frank da Cruz [mailto:[EMAIL PROTECTED]
> Sent: Monday, October 20, 2003 4:53 PM
> To: Jill Ramonsky
> Cc: [EMAIL PROTECTED]
> Subject: Re: Line Separator and Paragraph Separator
>
>
> At some point in the early 1990s, the thinking was that ASCII control
> characters were included in Unicode only for round-trip compatibility
> with existing character sets, but their semantics were
> undefined, and anyway
> they were not needed since they were from the bygone days of
> terminals and
> similar antique contraptions, whereas in modern times all
> text is "flowed"
> by "smart rendering engines".
>
> Ten years hence, the terminal-to-host model is still widely
> used, as is text
> with hard line breaks, but to convince the skeptics and
> ultra-modernists
> that line breaks were still a useful concept, I mentioned
> line-oriented
> programming languages (such as Fortran), and poetry.  Hence the line
> separator.
>
> Later everybody realized you couldn't stamp out ASCII control
> characters,
> so we're still using them; LS and PS never caught on as far as I know.
> Although obviously, LS would have been an improvement over
> the existing
> situation, in which different line separators (CR, LF, CRLF) are used
> on different platforms, which would otherwise have compatible text
> record formats, which to this day causes no end of confusion.
>
> At some point after Unicode 2.0, the C1 controls were adopted
> from ISO 6429,
> in which we have a Next Line control (NEL, U+0085), which
> might also have
> served the purpose, but it never caught on either.
>
> - Frank
>

Backslash n [OT] was Line Separator and Paragraph Separator

2003-10-21 Thread Jill Ramonsky






Call me pedantic, but

>From "The C Programming Language", Second Edition, by Brian W.
Kernighan and Dennis M. Ritchie (the architects of C), page 193, which
explicitly lists all allowable escape sequences. This defines them as
follows:

    newline   NL (LF)   \n
    horizontal tab    HT    \t
    vertical tab  VT    \v
    backspace BS    \b
    carriage return   CR    \r
    formfeed  FF    \f
    audible alert BEL   \a
    backslash \ \\
    question mark ? \?
    single quote  ' \'
    double quote  " \"
    octal number      ooo   \ooo
    hex number    hh    \xhh

The second column is clearly a list of ASCII characters, and the first
seven items are clearly a list of ASCII character names. The mapping
for \n is explicitly "NL (LF)". My interpretation of this is that this
can only refer to the ASCII control character LF, since that is what is
explicitly named. (Yes, it's in brackets, but the accompanying text
supplies no addition meaning to or interpretation of those brackets).
This is axiomatically *THE* definition. Period. Everything else is
merely quoting, rephrasing or reinterpretting this original.

I would be more than grateful if someone could point me in the
direction of a DEFINITVE specification which claims this is not the
case, that the interpretion of "\n" as anything other than LF may be
considered conformant behaviour. (Usage does not define conformace to a
standard). I will then be happy to modify my claim.

Of course, I'm not suggesting that violating this spec is wrong, merely
that violating this spec is violating this spec.
Jill


> -Original Message-
> From: John Cowan [mailto:[EMAIL PROTECTED]]
> Sent: Monday, October 20, 2003 6:32 PM
> To: Jill Ramonsky
> Cc: [EMAIL PROTECTED]
> Subject: Re: Line Separator and Paragraph Separator
> 
> > Strictly speaking, BY DEFINITION (from 
> the C and C++ 
> > specs), "\n" is supposed to mean LF, and nothing else,  
> 
> It means any one character that serves a new-linish function, 
> which can
> be LF or CR or NEL, for example.  On EBCDIC-based systems, the
native
> C compiler interprets \n as 0x25, which is NEL.

Re: PUA

2003-10-21 Thread Philippe Verdy

Marco Cimarosti <[EMAIL PROTECTED]> writes:
> Now, my PuaInterpretation variable contains the following information:
>
> Foobar.ttf
>
> And my string contains the following text:
>
> 
> (U+E017 U+E009)
>
> Now, what's the next step? What am I supposed to do to find out whether,
> according to the PUA interpretation called "Foobar.ttf", U+E017 and U+E009
> are letters or not?

Effectively, I don't like the idea of tagging PUA text with "font names
tags".

I'd rather prefer tagging the PUA text with "script name tags" (I mean
the extended user-defined script codes like "x-klingon", followed by
a base codepoint indicator and a codespace length like
"x-klingon;b=E000;l=80):

- this gives a real interpretation to PUAs, evaluated in their context,

- it allows remapping them locally to other ranges in case of conflict
between
multiple PUA conventions uses

- the script indicator name can be mapped locally to a character properties
database, indexed at the relative codepoint in the PUA convention codespace.

- any number of fonts can be designed to work with PUAs even if they are
sharing conflicting codespaces.

- any language can use this system.

- no more need for extra planes

- experimentation with new scripts still not standardized is possible,
including
for character properties, breaking behavior, layout, grapheme clustering,
...

- emulation of new standardized scripts becomes possible on previous
implementations that lack support for new characters or scripts...

Re: A certain committee?

2003-10-21 Thread Doug Ewell

Jill Ramonsky wrote:

> Who were this "certain committee"? And why did they have so much
> control over the Unicode Consortium that they could force the
> introduction of a new character block that nobody had ever previously
> used? What was this "abuse of UTF-8" of which you speak. Indeed, what
> is an "abuse" of UTF-8? What does the phrase even mean?

The so-called "Multi-Lingual String Format" was described in an
Internet-Draft, draft-ietf-acap-mlsf-01.txt, written by Chris Newman of
Innosoft in June 1997.  It was an attempt to define a lightweight,
inline language tagging protocol for ACAP (Application Configuration
Access Protocol) using invalid UTF-8 sequences, such as  for
"en".

The protocol was described as "another layer of encoding on top of
UTF-8," but since there was no signature mechanism or other way for
UTF-8 processors to tell this MLSF from normal (corrupted) UTF-8 text,
it was effectively a non-standard extension of UTF-8.

At the time this was proposed, UTF-8 was still new and not very widely
adopted, and there was apparently great concern within the UTC that this
non-standard extension would undermine the stability of the UTF-8 format
(just as the tacit approval of non-shortest UTF-8 sequences was
criticized as a security hole years later).  Plane 14 tags were
introduced as an equally lightweight countermeasure to persuade the ACAP
people to abandon MLSF in favor of an official tagging mechanism that
used real (but out-of-the-way) Unicode characters and did not break the
rules of UTF-8.

> How can you possibly add a block of characters to Unicode and then say
> "the UTC sincerely hopes that they never get used at all"?
> (Particularly when there are still people around whose actual real
> characters are still not being added).

First, the comparison between adding this special-purpose tagging
mechanism and adding "actual real characters" that are part of some
writing system is disingenuous.  Nobody ever made a choice between
encoding Tai Lue, Rejang, or Plane 14 tags.

Second, there are those of us (outside the UTC) who do feel that Plane
14 language tags have a valid use, since not all text that may benefit
from language tagging is necessarily in a marked-up format.  But the
writing is on the wall, and "those of us" have given up our battle.

> If this "certain committee" had intended to (falsely) declare
> something as UTF-8 and then embed something like:
>
> lang=en-uk
>
> where  and  are invalid UTF-8 byte-sequences, then so what?
> That would simply mean that "a certain committee"'s code wouldn't then
> interoperate with the rest of the world. Why is that any business of
> the UC's?

Because they were publishing their mechanism as an Internet-Draft, which
would soon have graduated to being an RFC, and then other groups might
have picked it up.  Again, if you think back to 1997, the most commonly
referenced definition of UTF-8 itself was an RFC.

> Hell, if only the KLI had thought to implement the Klingon alphabet in
> invalid UTF-8 sequences - then maybe the UC would have added Klingon
> characters just to shut them up, saying things like "it's not really a
> script", and "the UTC sincerely hopes that they never get used at
> all". Could have saved an awful lot of time!

With all respect, this completely misrepresents the intent and working
process of the UTC.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

37 matches

Mail list logo