from:"John Cowan"

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan

Mike Ayers scripsit:

>   I thought that URLs were specified to be in Unicode.  Am I mistaken?

You are.  URLs are specified to be in *ASCII*.  There is a %-encoding
hack that allows you to represent random-octet filenames as ASCII.
Some people (including me) think it's a good idea to use this hack
to specify non-ASCII characters with double encoding (first as UTF-8,
then with the %-hack), but the URI Syntax RFC doesn't say.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.reutershealth.comhttp://www.ccil.org/~cowan
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues.  --Cousin James

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan

Peter Kirk scripsit:

> I think the problem here is that a Unix filename is a string of octets, 
> not of characters. And so it should not be converted into another 
> encoding form as if it is characters; it should be processed at a quite 
> different level of interpretation.

Unfortunately, that is simply a counsel of perfection.

Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)

-- 
Income tax, if I may be pardoned for saying so,     John Cowan
is a tax on income.  --Lord Macnaghten (1901)   [EMAIL PROTECTED]

Re: RE: Roundtripping in Unicode

2004-12-13 Thread John Cowan

Doug Ewell scripsit:

> "When faced with [an] ill-formed code unit sequence while transforming
> or interpreting text, a conformant process must treat the first code
> unit... as an illegally terminated code unit sequence -- for example, by
> signaling an error, filtering the code unit out, or representing the
> code unit with a marker such as U+FFFD REPLACEMENT CHARACTER."

Plan 9, the original all-UTF-8 environment (it was translated
in a single day from Latin-1 to UTF-8), represents ill-formed code unit
sequences with the otherwise useless U+0080, on the grounds that an
ill-formed code is semantically different from an untranslatable
character, which is the purpose of U+FFFD.

-- 
LEAR: Dost thou call me fool, boy?  John Cowan
FOOL: All thy other titles  http://www.ccil.org/~cowan
 thou hast given away:  [EMAIL PROTECTED]
  That thou wast born with. http://www.reutershealth.com

Re: Nicest UTF

2004-12-13 Thread John Cowan

Lars Kristan scripsit:

> > I'm using ISO-8859-2.
> In fact you're lucky. Many ISO-8859-1 filenames display correctly in
> ISO-8859-2. Not all users are so lucky.

It was a design point of ISO-8859-{1,2,3,4}, but not any other variants,
that every character appears either at the same codepoint or not at all.

-- 
John Cowan[EMAIL PROTECTED]
At times of peril or dubitation,  http://www.ccil.org/~cowan
Perform swift circular ambulation,http://www.reutershealth.com
With loud and high-pitched ululation.

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread John Cowan

Philippe Verdy scripsit:

> Didn't know that. Is this a very recent use?

It's been used as an English verb, adjective, and noun for 30-40 years
and perhaps much longer: see below.

> In France, I think that RSVP was introduced and widely used at end of 
> telegraphic messages (that contained lots of conventional acronyms), it 
> survived at the time of telex, but now it is renewed with SMS messages on 
> cellular phones, but is rarely used in emails.
> 
> May be this was introduced in English at the old time of telegraphs as a 
> useful abbreviation, but with a different meaning when it is used as a 
> verb for saying "reply as requested"?

As far as I know, they were first used in formal invitations (to weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase "The favor of your reply is requested".
Later came the "RSVP card", a small card included with the invitation
for the invitee to respond with.  "An RSVP" of course means "a reply to
an invitation marked 'RSVP'."

-- 
My corporate data's a mess! John Cowan
It's all semi-structured, no less.  http://www.ccil.org/~cowan
But I'll be carefree[EMAIL PROTECTED]
Using XSLT  http://www.reutershealth.com
On an XML DBMS.

Re: Nicest UTF

2004-12-10 Thread John Cowan

Philippe Verdy scripsit:

> And I disagree with you about the fact the U+ can't be used in XML 
> documents. It can be used in URI through URI escaping mechanism, as 
> explicitly indicated in the XML specification...

You have a hold of the right stick but at the wrong end.  U+ can be
encoded in a URI as %00, but that does not mean that the IRIs in system ids
and namespace names (and potentially other places) can contain explicit
U+ characters or � escapes either.  Both of those are illegal,
and documents that contain them are not well-formed.

In character content and attribute values, U+ is not possible.

> And the fact that the various character productions, that are normally 
> normative, have been changed so often, sometimes through erratas that 
> were forgotten in the text of the next edition of the standard,  

Do you have evidence for this claim?

> The only thing about which I can agree is that XML will forbid surrogates 
> and U+FFFE and U+, but I won't say that a XML parser that does not 
> reject NULs or other non-characters or "disallowed" C0 controls is so 
> much buggy. 

You are of course entitled to your uninformed opinion.

> But all these is also a proof that XML documents are definitely NOT 
> plain-text documents, so you can't use Unicode encoding rules at the 
> encoded XML document level, only at the finest plain-text nodes (these 
> are the levels that the productions in the XML standard are trying, with 
> more or less success, to standardize).

You can't blindly do *normalization* of XML documents as if they were
plain text.  *Encoding* XML documents according to Unicode is of course
possible and desirable.

> As a consequence any process that blindly applies a plain-text 
> normalization to a complete XML document is bogous, because it breaks the 
> most basic XML conformance, i.e. the core document structure...

In one extraordinarily unlikely case, yes: the appearance of a
combining overlay slash following the ">" that closes a tag will
damage the document if it is NFC-normalized.

-- 
You are a child of the universe no less John Cowan
than the trees and all other acyclichttp://www.reutershealth.com
graphs; you have a right to be here.http://www.ccil.org/~cowan
  --DeXiderata by Sean McGrath  [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-10 Thread John Cowan

Philippe Verdy scripsit:

> >Okay, I'm confused. Does ≮ open a tag? Does it matter if it's 
> >composed or decomposed?
> 
> It does not open a XML tag.
> It does matter if it's composed (won't open a tag) or decomposed (will 
> open a tag, but with a combining character, invalid as an identifier 
> start)

Let's be precise here.  If the 7-character character sequence "蠔"
appears in an XML document, it never opens a tag and it is never changed
by normalization.  If the 1-character sequence consisting of a single
U+226E appears in an XML document, and that document is put through
NF(K)D, it will become not well-formed.  However, NF(K)D is not
recommended for XML documents, which should be in NFC.

-- 
First known example of political correctness:   John Cowan
"After Nurhachi had united all the otherhttp://www.reutershealth.com
Jurchen tribes under the leadership of the  http://www.ccil.org/~cowan
Manchus, his successor Abahai (1592-1643)   [EMAIL PROTECTED]
issued an order that the name Jurchen should   --S. Robert Ramsey,
be banned, and from then on, they were all The Languages of China
to be called Manchus."

Re: Nicest UTF

2004-12-10 Thread John Cowan

Philippe Verdy scripsit:

> If you look at the XML 1.0 Second Edition

The Second Edition has been superseded by the Third.

> Char   ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
> [#x1-#x10]

That is normative.

> But the comment following it specifies:

That comment is not normative and not meant to be precise.

> the restrictive 
> definition of "Char" above also includes the whole range of C1 controls 

By oversight.

> (#x80..#x9F), so I can't understand why the Char definition is so 
> restrictive on controls; in addition the definition of Char also 
> *includes* many non-characters (it only excludes surrogates, and U+FFFE 
> and U+, but forgets to exclude U+1FFFE and U+1, U+2FFFE and 
> U+2, ..., U+10FFFE and U+10).

By oversight again.

> Note however that nearly all XML parsers don't seem to honor this 
> constraint (like SGML parsers...)!

Please specify the parsers that do and don't honor this.  Any which
don't honor it are buggy, and any documents which exploit those bugs
are not XML.

> What is even worse is that XML 1.1 now reallows NUL for system 
> identifiers and URIs, through escaping mechanisms.

Not true.  U+ is absolutely excluded in both XML 1.0 and XML 1.1.

-- 
"I could dance with you till the cows   John Cowan
come home.  On second thought, I'd  http://www.ccil.org/~cowan
rather dance with the cows when you http://www.reutershealth.com
came home."  --Rufus T. Firefly [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-10 Thread John Cowan

Marcin 'Qrczak' Kowalczyk scripsit:

> http://www.w3.org/TR/2000/REC-xml-20001006#charsets
> implies that the appropriate level for parsing XML is code points.

You are reading the XML Recommendation incorrectly.  It is not defined
in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of
characters.  XML processors are required to process UTF-8 and UTF-16,
and may process other character encodings or not.  But the internal
model is that of characters.  Thus surrogate code points are not
allowed.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash,
The day and hour soon are coming / When all the IT folks say "Gosh!"
It isn't from a clever lawsuit / That Windowsland will finally fall,
But thousands writing open source code / Like mice who nibble through a wall.
--The Linux-nationale by Greg Baker

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-10 Thread John Cowan

Kenneth Whistler scripsit:

> On the other hand, for many English speakers, "RSVP" is simply
> learned as an unanalyzed verb, pronounced "aressveepee", meaning
> "send a response to this message". And to castigate such speakers
> for politely prepending a "please" to that verb is a little
> too much, don't you think?

It's also pervasive in English:  SALT talks, OPEC countries (or nations),
Missisippi River, Gobi Desert.

-- 
"[T]he Unicode Standard does not encode John Cowan
idiosyncratic, personal, novel, or private  http://www.ccil.org/~cowan
use characters, nor does it encode logoshttp://www.reutershealth.com
or graphics."   [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-10 Thread John Cowan

Marcin 'Qrczak' Kowalczyk scripsit:

> > The XML/HTML core syntax is defined with fixed behavior of some
> > individual characters like '&', '<', quotation marks, and with special
> > behavior for spaces.
> 
> The point is: what "characters" mean in this sentence. Code points?
> Combining character sequences? Something else?

Neither.  Unicode characters.

-- 
"May the hair on your toes never fall out!" John Cowan
--Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]

Re: Nicest UTF

2004-12-08 Thread John Cowan

Marcin 'Qrczak' Kowalczyk scripsit:

> String equality in a programming language should not treat composed
> and decomposed forms as equal. Not this level of abstraction.

Well, that assumes that there's a special "string equality" predicate, as
distinct from just having various predicates that DWIM.  In a Unicode Lisp
implementation, e.g., equal might be char-by-char equality and equalp might not.

> They are supposed to be equivalent when they are actual characters.
> What if they are numeric character references? Should "≮"
> (7 characters) represent a valid plain-text character or be a broken
> opening tag?

It's a broken opening tag.

> Note that if it's a valid plain-text character, it's impossible
> to represent isolated combining code points in XML, 

It's problematic to represent the *specific* combining code point
when it appears immediately after a tag.

-- 
Don't be so humble.  You're not that great. John Cowan
--Golda Meir[EMAIL PROTECTED]

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread John Cowan

Kenneth Whistler scripsit:

> A Sybase ASE database has the same behavior running on Windows as
> running on Sun Solaris or Linux, for that matter.

Fair enough.

> UNIX filenames are just one instance of this. 

However, although they are *technically* octet sequences, they
are *functionally* character strings.  That's the issue.

> Failing that, then BINARY fields *are* the appropriate
> way to deal with arbitrary arrays of bytes that cannot
> be interpreted as characters. 

This is purism.  All the filenames on my Unix system, for example, can
be interpreted as character strings; the potential to create filenames
that can't be is unutilized, and sensibly so.  For that matter, the
potential to create files containing C0 controls is also unutilized.

> > in the same way that it would
> > be overkill to encode all 8-bit strings in XML using Base-64
> > just because some of them may contain control characters that are
> > illegal in well-formed XML.
> 
> Dunno about the XML issue here -- you're the expert on what
> the expected level of illegality in usage is there.

XML's policy is zero tolerance, both for illegal encodings and for
illegal characters such as U+0001.  So in order to be *100% sure* that
a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put
into an XML document, one must treat it as binary and encode it as such,
using QP or Base64 or what have you.  But nobody does.

XML 1.1 allows the representation of every Unicode character except
U+, which materially reduces the problem, but there is little support
for XML 1.1 as yet.

In any case, this case is only an analogy, not an exact equivalent:
the problems of representing illegal *characters* in an XML document is
closely analogous to the problem of representing illegal *bytes* in a
character string.

> The point I'm making is that *whatever* you do, you are still
> asking for implementers to obey some convention on conversion
> failures for corrupt, uninterpretable character data.
> My assessment is that you'd have no better success at making
> this work universally well with some set of 128 magic bullet
> corruption pills on Plane 14 than you have with the
> existing Quoted-Unprintable as a convention.

It doesn't have to work universally; indeed, it becomes a QOI issue.
Allocating representations of bytes with "bits that are high" makes
it possible to do something recoverable, at very little expense to the
Unicode Consortium.

> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.

I agree that that part won't fly, absolutely.

-- 
In politics, obedience and support  John Cowan <[EMAIL PROTECTED]>
are the same thing.  --Hannah Arendthttp://www.ccil.org/~cowan

Re: OpenType not for Open Communication?

2004-12-08 Thread John Cowan

Michael Everson scripsit:

> >I think it's more accurate to say that you need to find a way to
> >compensate the font developer for his effort; this need not involve 
> >money.
> >I, for example, create programs and give them to people for a reward I
> >consider sufficient; professionally, I write bespoke software which is
> >not useful to anyone but my employer; some 90% of all software written
> >is in this class.
> 
> Read: "my employer pays me to make software".

Yes, of course; I thought that was implicit.  My point is that not all
the software made is paid for, not by a long chalk, and most of what is
made is not sold to anyone.  Neither of these things is true of fonts
in any but the most trivial ways -- yet.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, LOTR:FOTR

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread John Cowan

Kenneth Whistler scripsit:

> Storage of UNIX filenames on Windows databases, for example,
> can be done with BINARY fields, which correctly capture the
> identity of them as what they are: an unconvertible array of
> byte values, not a convertible string in some particular
> code page.

This solution, however, is overkill, in the same way that it would
be overkill to encode all 8-bit strings in XML using Base-64
just because some of them may contain control characters that are
illegal in well-formed XML.

> In my opinion, trying to do that with a set of encoded characters
> (these 128 or something else) is *less* likely to solve the
> problem than using some visible markup convention instead.

The trouble with the visible markup, or even the PUA, is that
"well-formed filenames", those which are interpretable as
UTF-8 text, must also be encoded so as to be sure any
markup or PUA that naturally appears in the filename is
escaped properly.  This is essentially the Quoted-Printable
encoding, which is quite rightly known to those stuck with
it as "Quoted-Unprintable".

> Simply
> encoding 128 characters in the Unicode Standard ostensibly to
> serve this purpose is no guarantee whatsoever that anyone would
> actually implement and support them in the universal way you
> envision, any more than they might a "=93", "=94" convention.

Why not, when it's so easy to do so?  And they'd be *there*,
reserved, unassignable for actual character encoding.

Plane E would be a plausible location.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, LOTR:FOTR

Re: Word dividers, was: proposals I wrote (and also, didn't write)

2004-12-07 Thread John Cowan

Peter Kirk scripsit:

> I notice that Elaine is here proposing a HEBREW SAMARITAN PUNCTUATION 
> WORD DIVIDER - and this should be in the BMP as Samaritan is a script in 
> modern list. But there is already in the pipeline a PHOENICIAN WORD 
> SEPARATOR, provisionally U+1091F, and already defined U+10101 AEGEAN 
> WORD SEPARATOR DOT, and also of course U+00B7 MIDDLE DOT. The glyphs for 
> all of these seem indistinguishable, and so are the functions. The only 
> difference seems to be the scripts they are associated with, but 
> punctuation marks are supposed to be not tied to individual scripts.

Well, some are and some aren't.  Arabic ? is definitely tied to Arabic,
for example.  As usual, Unicode is empirical rather than rational.

In any case, MIDDLE DOT, despite its official classification as
punctuation, requires special treatment because of its use in
Catalan orthography as effectively a modifier letter, so it is
not useful to unify it with anything else.  (It is already
canonically equivalent to GREEK ANO TELEIA, which is regrettable.)

> Is there really a need for so many almost identical word divider dots? 

Probably not.  We already have gobs of dots.  It's one of those things:
on the other hand, Unicode unifies all the Indic dandas, for example.

-- 
But you, Wormtongue, you have done what you could for your true master.  Some
reward you have earned at least.  Yet Saruman is apt to overlook his bargains.
I should advise you to go quickly and remind him, lest he forget your faithful
service.  --Gandalf John Cowan <[EMAIL PROTECTED]>

Re: OpenType not for Open Communication?

2004-12-07 Thread John Cowan

John Hudson scripsit:

> OpenType is a trademark of Microsoft and a proprietary font format 
> jointly developed by Microsoft and Adobe. 

The question is, is it an open standard?  That is, is anyone free to
create OpenType fonts, OpenType font tools, OpenType font renderers?
Is the documentation freely available at no more than nominal cost?

> Unicode is a text encoding standard. Fonts and other software implement 
> the standard. The 'openness' of the standard doesn't imply anything about 
> the 'openness' of the software.

Indeed.

> Font developers are under no obligation to provide you with free fonts. 
> Do you not charge for your work? If you want fonts to be freely 
> available, you have to find some way to pay for their development,  

I think it's more accurate to say that you need to find a way to
compensate the font developer for his effort; this need not involve money.
I, for example, create programs and give them to people for a reward I
consider sufficient; professionally, I write bespoke software which is
not useful to anyone but my employer; some 90% of all software written
is in this class.

(Are there bespoke fonts which the buyer keeps to himself?)

-- 
Using RELAX NG compact syntax toJohn Cowan
develop schemas is one of the simplehttp://www.reutershealth.com
pleasures in life   http://www.ccil.org/~cowan
--Jeni Tennison <[EMAIL PROTECTED]>

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread John Cowan

Doug Ewell scripsit:

> > Now suppose you have a UNIX filesystem, containing filenames in a
> > legacy encoding (possibly even more than one). If one wants to switch
> > to UTF-8 filenames, what is one supposed to do? Convert all filenames
> > to UTF-8?
> 
> Well, yes.  Doesn't the file system dictate what encoding it uses for
> file names?  How would it interpret file names with "unknown" characters
> from a legacy encoding?  How would they be handled in a directory
> search?

Windows filesystems do know what encoding they use.  But a filename on
a Unix(oid) file system is a mere sequence of octets, of which only 00
and 2F are interpreted.  (Filenames containing 20, and especially 0A,
are annoying to handle with standard tools, but not illegal.)

How these octet sequences are translated to characters, if at all,
is no concern of the file system's.  Some higher-level tools, such as
directory listers and shells, have hardwired assumptions, others have
changeable assumptions, but all are assumptions.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
No man is an island, entire of itself; every man is a piece of the
continent, a part of the main.  If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friends or of thine own were: any man's death diminishes me,
because I am involved in mankind, and therefore never send to know for
whom the bell tolls; it tolls for thee.  --John Donne

Re: current version of unicode-font

2004-12-02 Thread John Cowan

Paul Hastings scripsit:

> speaking of which, *are* there any open source fonts that come even 
> close to Arial Unicode MS?

In what, breadth of coverage or aesthetics?  The GNU Unifont has very
wide coverage though it is a bitmap font; James Kass's CODE 2000 and CODE
2001 probably have the widest coverage of any font, though it costs US$5
to use them.  Both of them IMHO are a tad on the ugly side.

Googling for "free Unicode fonts" (no quotes) is useful.

-- 
One Word to write them all,     John Cowan <[EMAIL PROTECTED]>
  One Access to find them,  http://www.reutershealth.com
One Excel to count them all,http://www.ccil.org/~cowan
  And thus to Windows bind them.--Mike Champion

Re: Relationship between Unicode and 10646

2004-11-30 Thread John Cowan

Peter Kirk scripsit:

> >There are a number of people, yourself included, who are actively, 
> >either maliciously or from ignorance, misrepresenting the relationship 
> >between the UTC and WG2, and of the standardization process, under the 
> >guise of "innocent" discussion. ...
> 
> I have merely been asking searching questions, partly from ignorance I 
> agree. If you or anyone else considers that I have been misrepresenting 
> the relationship, you are free to correct me.

Your main misunderstanding seems to be your belief that WG2 is a
democratic body; that is, that it makes decisions by majority vote.
Decisions are made by explicitly reached consensus, and ballots are an
instrument of reaching consensus.  Every "no" vote must be accompanied
by comments such that, if they were accepted, the "no" vote would be
changed to "yes".  ("Yes" votes can have comments too.)  The result of a
"no" vote is that the process loops until all such votes are resolved.
Although the UTC does not have a vote as such, being a liaison member,
its input is treated as if it were vote comments.

If consensus cannot be reached, the proposal is eventually dropped, I suppose.

-- 
Time alone is real  John Cowan <[EMAIL PROTECTED]>
  the rest imaginaryhttp://www.reutershealth.com
like a quaternion   --phma  http://www.ccil.org/~cowan

Re: (base as a combing char)

2004-11-27 Thread John Cowan

Philippe Verdy scripsit:

> For this reason, Dutch will need a distinct "ij" 
> letter, coded as a single character, and with its own capitalization rules 
> (the uppercase or titlecase form of "ij" will be the single letter "IJ", 
> not two letters and not "Ij"; also there exists cases where diacritics can 
> be added on top of the "ij" letter, which is then more tied as a single 
> letter than a simple digraph.)

Everything you say is correct *except* for the need to encode Dutch
ij as a single character, which is neither necessary nor practical.
(U+0132 and U+0133 are encoded for compatibility only.)  In cases where
ij is a digraph in Dutch text, i+ZWNJ+j will be effective.

-- 
"Kill Gorgïn!  Kill orc-folk!   John Cowan
No other words please Wild Men. [EMAIL PROTECTED]
Drive away bad air and darkness http://www.reutershealth.com
with bright iron!"  --Ghïn-buri-Ghïnhttp://www.ccil.org/~cowan

Re: Relationship between Unicode and 10646

2004-11-26 Thread John Cowan

Peter Kirk scripsit:

> I don't want to go along with Philippe entirely on this, but surely he 
> must be right on this last point. Formally, Unicode is effectively the 
> agent of just one national body in this decision-making process.  

The Unicode Consortium is not an agent of the USNB, although it is
a U.S. corporation.  It is itself an international organization, even
having some governmental bodies as members (agenciess of the Indian and
Pakistani national governments and the Tamil Nadu state government),
one intergovernmental organization, one international non-governmental
organization, and at least a dozen non-U.S. corporations.

> But formally these other bodies do have the right to 
> outvote Unicode, and in effect to force Unicode to reverse its decisions 
> - or else to reverse its policy of maintaining compatibility.

Formally, yes.  However, by acts of self-abnegation, WG2 has a fixed
policy of not overriding the UTC or vice versa.

> Here in Europe it does not go down well when US bodies claim the right 
> to make decisions for the whole world, 

It's a mistake to think of the Consortium as a U.S. body.

-- 
Mark Twain on Cecil Rhodes: John Cowan
"I admire him, I freely admit it,   http://www.ccil.org/~cowan
 and when his time comes I shallhttp://www.reutershealth.com
 buy a piece of the rope for a keepsake."   [EMAIL PROTECTED]

Re: Misuse of 8th bit [Was: My Querry]

2004-11-26 Thread John Cowan

Antoine Leca scripsit:

> In a similar vein, I cannot be in agreement that it could be advisable to
> use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a
> Unicode codepoint. Right now, nobody is seeing any use for them as part of
> characters, but history should have learned us we should prevent this kind
> of optimisations to occur.

No, I don't agree with this part.  Unicode just isn't going to expand
past 0x10 unless Earth joins the Galactic Empire.  So the upper bits
are indeed free for private uses.

> Particularly when it is NOT defined by the
> standards: such a situation leads everybody and his dog to find his
> particular "optimum" use for these "free space", and these classes of
> optimums do not generally collides between them...

I don't think this matters as long as the upper bits are not used in
interchange.  For example, it would be reasonable to represent Unicode
characters as immediates on a virtual machine by using some pattern in
the upper bits that flags them as characters.

-- 
Eric Raymond is the Margaret Mead   John Cowan
of the Open Source movement.[EMAIL PROTECTED]
--Bruce Perens, http://www.ccil.org/~cowan
  some years agohttp://www.reutershealth.com

Re: My Querry

2004-11-23 Thread John Cowan

Antoine Leca scripsit:

> Sorry, no: there is no requirement to clear it.
> You are assuming something about the way data are handled. When you handle
> ASCII data using octets, you can perfectly, and conformantly, keep some
> other "data" (being parity or whatever) inside the 8th bit; so with even
> parity AT SIGN will be managed as 192, without any kind of problem (for
> you). 

Indeed, the DEC PDP-8 stored ASCII data with the high bit always set, for
compatibility with the way in which ASR-33 Teletypes generated ASCII.
So I grew up thinking of 'A' as 301 (octal).  I believe that PR1ME
computers did this too.

-- 
But you, Wormtongue, you have done what you could for your true master.  Some
reward you have earned at least.  Yet Saruman is apt to overlook his bargains.
I should advise you to go quickly and remind him, lest he forget your faithful
service.  --Gandalf John Cowan <[EMAIL PROTECTED]>

Re: Unicode HTML, download

2004-11-21 Thread John Cowan

Peter Kirk scripsit:

> Please read my earlier posting. Of course it does make things rather 
> difficult that none of my postings ever get approved on a Sunday, 
> especially when I am trying to correct seriously misleading factual errors.

Yr hble Hebrew Moderator attempts to work 24/7, but occasionally the need
to sleep or to engage in business (I was at a conference all last week)
or family business (a death in a friend's family) interferes with this
otherwise laudable goal.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
"In computer science, we stand on each other's feet."
--Brian K. Reid

Re: [even more increasingly OT-- into Sunday morning] Re: Unicode HTML, download

2004-11-21 Thread John Cowan

Michael (michka) Kaplan scripsit:

> > I haven't used M$ IE for many years, though, and my
> > memory might be wrong.
> 
> Blinded by the misspelling of the product name, maybe? :-)

No, that's just a glyph difference.  :-)

> See http://msdn.microsoft.com/msdnmag/issues/0700/localize/ and the section
> entitled "Choosing Character Sets" for info on what is going on here,
> particularly firgures 3 and 4 for info on how to script the behavior for the
> UTF-8 case

Nice article, though it's obnoxious that the figures will only open
in a pop-up window.  

-- 
Ambassador Trentino: I've said enough. I'm a man of few words.
Rufus T. Firefly: I'm a man of one word: scram!
--Duck Soup John Cowan <[EMAIL PROTECTED]>

Re: U+0000 in C strings

2004-11-15 Thread John Cowan

Philippe Verdy scripsit:

> The "modified UTF-8" encoding is only for use in the serialization of 
> compiled classes that contain a constant string pool, and through the JNI 
> interface to C-written modules using the legacy *UTF() APIs that want to 
> work with C strings.

Plus the original point of contention: binary serialization of Strings
through the DataInput and DataOutput interfaces.

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
In might the Feanorians / that swore the unforgotten oath
brought war into Arvernien / with burning and with broken troth.
and Elwing from her fastness dim / then cast her in the waters wide,
but like a mew was swiftly borne, / uplifted o'er the roaring tide.
--the Earendillinwe

Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

2004-11-15 Thread John Cowan

Doug Ewell scripsit:

> Then why do the DataInput and DataOutput interfaces perform this special
> conversion?  There isn't any mention, on the page whose URL Theodore
> originally provided, of compatibility with C strings.

Probably because Sun was reusing the format that string literals take in
compiled Java classes.  The format is as compact as UTF-8 provided your
characters are in the range U+0001 to U+, which is true most of the time.
Serializing with a 32-bit length would be much bulkier.

> If a Java String consists of a count followed by the data, 

I didn't say that.  A Java String in memory contains a count and the data,
because it is basically a wrapper around a Java array of characters, and Java
arrays contain a count.  (Strings, unlike arrays, are immutable in Java.)
That does not mean that the count is "followed by" the data in the memory
representation, which indeed is up to the JVM -- Java does not prescribe it.

> Those are design benefits.  I was asking about the ability to represent
> text adequately.

Strings are not used solely to represent text; they are more general.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Consider the matter of Analytic Philosophy.  Dennett and Bennett are well-known.
Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett.
There is also one Dummett.  By their works shall ye know them.  However, just as
no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly
known by his works.  Indeed, Bummett does not exist.  It is part of the function
of this and other e-mail messages, therefore, to do what they can to create him.

Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)

2004-11-14 Thread John Cowan

Doug Ewell scripsit:

> As soon as you can think of one, let me know.  I can think of plenty of
> *binary* protocols that require zero bytes, but no *text* protocols.

Most languages other than C define a string as a sequence of characters
rather than a sequence of non-null characters.  The repertoire of characters
than can exist in strings usually has a lower bound, but its full magnitude
is implementation-specific.  In Java, exceptionally, the repertoire is
defined by the standard rather than the implementation, and it includes
U+.  In any case, I can think of no language other than C which does
not support strings containing U+ in most implementations.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
"But no living man am I!  You look upon a woman.  Eowyn I am, Eomund's daughter.
You stand between me and my lord and kin.  Begone, if you be not deathless.
For living or dark undead, I will smite you if you touch him."

Opinions on this Java URL?

2004-11-13 Thread John Cowan

Theodore H. Smith scripsit:

> I'm just curious about the \0 thing. What problems would having a \0 in 
> UTF-8 present, that are not presented by having \0 in ASCII? I can't 
> see any advantage there.

AFAICT it was a hack so that arbitrary Java strings could be encoded
as C strings; that is, with no 0x00 bytes in them, even when the
string contained a U+.  This is the format used in Java class
files for string constants as well.

The important thing is to note that the readUTF and writeUTF methods are
*binary* I/O; they are the standard way of serializing strings,
just as the standard way of serializing ints is to write them out
as a 4-byte big-endian sequence.

They simply have nothing to do with character encoding at all.

-- 
He made the Legislature meet at one-horse   John Cowan
tank-towns out in the alfalfa belt, so that [EMAIL PROTECTED]
hardly nobody could get there and most of   http://www.reutershealth.com
the leaders would stay home and let him go  http://www.ccil.org/~cowan
to work and do things as he pleased.--Mencken, Declaration of Independence

Re: not font designers?

2004-11-02 Thread John Cowan

Elaine Keown scripsit:

> >Just of curiosity, how many of you are NOT font
> >designers?  
> >
> >And are any of your corpus linguists, text database
> >people, or maybe database designers?  

FWIW, I am none of those things (I've designed a database now and then,
but I'm hardly a "database designer").

-- 
The Imperials are decadent, 300 pound   John Cowan <[EMAIL PROTECTED]>
free-range chickens (except they have   http://www.reutershealth.com
teeth, arms instead of wings, and   http://www.ccil.org/~cowan
dinosaurlike tails).--Elyse Grasso

Re: basic-hebrew RtL-space ?

2004-11-02 Thread John Cowan

Doug Ewell scripsit:

> I've never understood why writing Hebrew or Arabic left-to-right is
> called "visual" order anyway.  These are RTL scripts; they are supposed
> to be not only written, but also read, right-to-left.  Wouldn't a reader
> of Hebrew or Arabic consider RTL to BE the "visual" order?

Of course.  It's sheer ethnocentricism.

-- 
Values of beeta will give rise to dom!  John Cowan
(5th/6th edition 'mv' said this if you triedhttp://www.ccil.org/~cowan
to rename '.' or '..' entries; see  [EMAIL PROTECTED]
http://cm.bell-labs.com/cm/cs/who/dmr/odd.html)

Re: Public Review Issue: UAX #24 Proposed Update

2004-09-09 Thread John Cowan

Peter Kirk scripsit:

> >Names are sometimes inaccurate, viz. ZINOR and ZARQA and the infamous 
> >FHTORA.  That doesn't change the meaning or utility of the character.
> 
> Agreed. It simply changes, indeed destroys completely, the utility of 
> the character name.

Not at all.  As I've told you before (and you agreed before), it's
just as much a fallacy to suppose that Unicode character names carry
no information as to suppose that they carry complete information.
The truth is somewhere between:  most names are helpful, a few names
are partially misleading (but not totally so).

As for FHTORA, it's annoying, but I don't see how it can be read as
anything but FTHORA if you know anything about Greek at all, which is
probably why it was overlooked until it was too late.

-- 
You escaped them by the will-death  John Cowan
and the Way of the Black Wheel. [EMAIL PROTECTED]
I could not.  --Great-Souled Samhttp://www.ccil.org/~cowan

Re: Public Review Issue: UAX #24 Proposed Update

2004-09-09 Thread John Cowan

Andrew C. West scripsit:

> "In principle when a character of a given script is used in more than one
> language, no language name is specified. Exceptions are tolerated where an
> ambiguity would otherwise result." [N2652R Annex L Rule 9]

Indeed, but this begs the question of whether the characters in question
are indeed unique to Yiddish or not.  My other point stands.

-- 
Babies are born as a result of the  John Cowan
mating between men and women, and most  http://www.reutershealth.com
men and women enjoy mating. http://www.ccil.org/~cowan
--Isaac Asimov in Earth: Our Crowded Spaceship  [EMAIL PROTECTED]

Re: Public Review Issue: UAX #24 Proposed Update

2004-09-09 Thread John Cowan

Jony Rosenne scripsit:

> The UTC refused to add Yiddish to the name, unlike the other Yiddish
> specialties, and I am not aware of any other possibility.

Why should it?  Incorporating a language name into a character name,
as in ABKHASIAN CHE and KHAKASSIAN CHE, is done because those languages
have a letter named CHE distinct from the more usual, cross-linguistic
Cyrillic CHE.  There is no such contrast in this case: we do not speak of
LATIN SMALL LETTER ICELANDIC THORN, for example.

-- 
Some people open all the Windows;   John Cowan
wise wives welcome the spring   [EMAIL PROTECTED]
by moving the Unix. http://www.reutershealth.com
  --ad for Unix Book Units (U.K.)   http://www.ccil.org/~cowan
(see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)

Re: Public Review Issue: UAX #24 Proposed Update

2004-09-08 Thread John Cowan

Jony Rosenne scripsit:

> FB1D, HEBREW LETTER YOD WITH HIRIQ, should be assigned to the unknown group.
> It is not a Hebrew character, notwithstanding the misleading name.

To anticipate Michael:  Of course it is.  It's not used in the Hebrew
language, perhaps; but the Hebrew script is used for other languages
besides Hebrew.

-- 
John Cowan   [EMAIL PROTECTED]   http://www.reutershealth.com
"Mr. Lane, if you ever wish anything that I can do, all you will have
to do will be to send me a telegram asking and it will be done."
"Mr. Hearst, if you ever get a telegram from me asking you to do
anything, you can put the telegram down as a forgery."

Japanese pitch accent representations

2004-09-05 Thread John Cowan

The following links show L-shaped marks, apparently combining
characters, that indicate the change-of-pitch position in Japanese
words written in romaji.  Are these novel characters, or can they
be identified with existing Unicode characters?  Are they really
combining?

http://member.newsguy.com/~sakusha/dict/martin-je.html

http://member.newsguy.com/~sakusha/dict/kenkyusha-je.html

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Consider the matter of Analytic Philosophy.  Dennett and Bennett are well-known.
Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett.
There is also one Dummett.  By their works shall ye know them.  However, just as
no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly
known by his works.  Indeed, Bummett does not exist.  It is part of the function
of this and other e-mail messages, therefore, to do what they can to create him.

Re: MSDN Article, Second Draft

2004-08-20 Thread John Cowan

Jungshik Shin scripsit:

> As is often the case, Unicode experts are not necessarily experts on 
> 'legacy' character sets and encodings. The 'official' name of 'ASCII' is 
> ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, 
> I'm afraid you're spreading misinformation about what came before it.
> The sentence that 'ANSI pushed this scope ... represents 256 characters' 
> is misleading. ANSI has nothing to do with various single, double, 
> triple byte character sets that make up single and multibyte character 
> encodings. They're devised and published by national and international 
> standard organizations as well as various vendors. Perhaps, you'd better 
> just get rid of the sentence 'ANSI pushed ... providing backward 
> compatibility with ASCII'.

Like it or not, "ANSI" has two meanings now: the American National
Standards Institute and a generic term for an 8-bit Windows codepage.
Similarly, "OEM" means both an original equipment manufacturer and an
8-bit PC-DOS codepage.

-- 
"No, John.  I want formats that are actually   John Cowan
useful, rather than over-featured megaliths that   http://www.ccil.org/~cowan
address all questions by piling on ridiculous  http://www.reutershealth.com
internal links in forms which are hideously[EMAIL PROTECTED]
over-complex." --Simon St. Laurent on xml-dev

Re: Combining across markup?

2004-08-12 Thread John Cowan

Anto'nio Martins-Tuva'lkin scripsit:

> Even better yet: Have the WC3 rephrase their demand that no element
> should start with a defective sequence (when considered in separate)
> as that no *block-level* element should etc., and leave things like
> ,  and other in-line elements free to start with a combining
> character (provided that the said in-line container is not the first
> within a block-level element, of course).

The trouble with that idea is that in XML generally we don't know
what is a block-level element: elements are just elements, and it's
up to rendering routines whether they appear as block, inline, or
not at all.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Promises become binding when there is a meeting of the minds and consideration
is exchanged. So it was at King's Bench in common law England; so it was
under the common law in the American colonies; so it was through more than
two centuries of jurisprudence in this country; and so it is today. 
   --Specht v. Netscape

Re: Much better Latin-1 keyboard for Windows

2004-07-23 Thread John Cowan

Michael Everson scripsit:

> >Interesting.  There seems to be no explanation of the seven keyboard
> >states shown in the graphic at ga-keys-x.gif.  Can you explicate them?
> 
> Hm? The shift, alt, and caps lock keys are shown depressed in the drawings.

Ah, that strange glyph is Alt, or rather AltGr, then.  I presume the
Swedish-church-symbol is functioning as the variety of Alt that makes
keyboard accelerators.

-- 
What is the sound of Perl?  Is it not the   John Cowan
sound of a [Ww]all that people have stopped [EMAIL PROTECTED]
banging their head against?  --Larryhttp://www.ccil.org/~cowan

Re: Much better Latin-1 keyboard for Windows

2004-07-22 Thread John Cowan

Michael Everson scripsit:

> Please see the specification of the Irish 
> Extended keyboard for Unicode, at 
> http://www.evertype.com/celtscript/ga-keys-x.html

Interesting.  There seems to be no explanation of the seven keyboard
states shown in the graphic at ga-keys-x.gif.  Can you explicate them?

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan
Female celebrity stalker, on a hot morning in Cairo:
"Imagine, Colonel Lawrence, ninety-two already!"
El Auruns's reply:  "Many happy returns of the day!"

Re: http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt

2004-07-19 Thread John Cowan

Asmus Freytag scripsit:

> >Is John Cowan's list supposed to be a complete list of
> >foldables for extant Hebrew code points?
> 
> We know its not.

It lists all the characters which have points embedded in them.
If you map all those characters away and delete all explicit points and
accents, you have unpointed Hebrew.

-- 
Verbogeny is one of the pleasurettesJohn Cowan <[EMAIL PROTECTED]>
of a creatific thinkerizer. http://www.reutershealth.com
   -- Peter da Silvahttp://www.ccil.org/~cowan

Re: Folding algorithm and canonical equivalence

2004-07-18 Thread John Cowan

Asmus Freytag scripsit:

> There are two options for a starting set:
> select all 'accents' (note, not baseforms) that occur in some 
> precomposed character. And then add additional ones on a case by case 
> basis (e.g. stroke overlay).
> 
> Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter will 
> be part of 4.1), and make some principled additions / deletions.

I'd say, start from the combining characters which are in ISO 10646
Level 3, and then add the combining characters from the abjads
(Hebrew, Arabic, Syriac).

-- 
XQuery Blueberry DOM    John Cowan
Entity parser dot-com   [EMAIL PROTECTED]
Abstract schemata   http://www.reutershealth.com
XPointer errata http://www.ccil.org/~cowan
Infoset Unicode BOM --Richard Tobin

Re: Folding algorithm and canonical equivalence

2004-07-18 Thread John Cowan

Peter Kirk scripsit:

> Anyway, is Yiddish in fact never written completely unpointed? That 
> would surprise me.

It might have happened at some point, but the standard (YIVO) Yiddish
orthography would become illegible if points were stripped.

-- 
Principles.  You can't say A is     John Cowan <[EMAIL PROTECTED]>
made of B or vice versa.  All mass  http://www.reutershealth.com
is interaction.  --Richard Feynman  http://www.ccil.org/~cowan

Re: Much better Latin-1 keyboard for Windows

2004-07-18 Thread John Cowan

Raymond Mercier scripsit:

> Jowh Cowan writes

Jowh?

> Latin-1 is not everything! If you need to transcribe
> Arabic/Hebrew/Sanskrit/Farsi, you will need the macrons on vowels (Latin
> Extended-A) and various dot-under letters (Latin Extended Additional). I
> made my own layout using the DDK.

No, it isn't everything, but it's a great deal, especially considering
the annoying behavior of the standard US-International keyboard.

Why not release your keyboard to the world?

-- 
"You're a brave man! Go and break through the   John Cowan
lines, and remember while you're out there  [EMAIL PROTECTED]
risking life and limb through shot and shell,   www.ccil.org/~cowan
we'll be in here thinking what a sucker you are!"   www.reutershealth.com
--Rufus T. Firefly

Much better Latin-1 keyboard for Windows

2004-07-17 Thread John Cowan

http://www.livejournal.com/users/gwalla/39856.html is a page about
(and a link to) a truly excellent Windows keyboard driver that
provides full access to the Latin-1 range but is completely compatible
with the US-ASCII keyboard except for AltGr (the right Alt key).
All non-ASCII characters and dead keys are available there: for
example, to get à, one types AltGr-` followed by a.

I can't recommend this too much; I immediately dropped both the US-ASCII
and US-International keyboards, which I have been using in alternation.
The only (very minor) problem with it is that for some reason it messes
up Ctrl-Shift and Ctrl-nonletter key combinations.

-- 
"Well, I'm back."  --SamJohn Cowan <[EMAIL PROTECTED]>

Re: Folding algorithm and canonical equivalence

2004-07-17 Thread John Cowan

Asmus Freytag scripsit:

> John, you proposed the initial set. Do you have any suggestion here?

My original submission had only the single-character mappings, not the
character pair mappings, which are just the result of decomposing the
precomposed set and don't IMHO make much sense: they are too selective.

The list predates TR#30; I developed it for the purpose of
making NFC Latin text minimally legible on an old ASCII-only
printer.  (I simply changed the filtering regex from "LATIN" to
"LATIN|GREEK|CYRILLIC|HEBREW".)  It was not intended to cope with
partially or fully decomposed text.

I agree that in the TR#30 context, the Right Thing is to remove the
character pair mappings altogether, and all of the single-character
mappings that have canonical decompositions.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
I come from under the hill, and under the hills and over the hills my paths
led. And through the air. I am he that walks unseen.  I am the clue-finder,
the web-cutter, the stinging fly. I was chosen for the lucky number.  --Bilbo

Re: Folding algorithm and canonical equivalence

2004-07-17 Thread John Cowan

Peter Kirk scripsit:

> But I think the best thing to do is to drop *all* Hebrew 
> combining marks; the result of this is valid unpointed Hebrew.  

I agree.

-- 
Schlingt dreifach einen Kreis vom dies!    John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau, http://www.reutershealth.com  
Denn er genoss vom Honig-Tau,  http://www.ccil.org/~cowan  
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)

Re: Umlaut and Tréma, was: Variation selectors and vowel marks

2004-07-13 Thread John Cowan

Doug Ewell scripsit:

> CGJ + COMBINING DIAERESIS is a hack, but then again the need to draw a
> distinction between the exact same combining mark used for two different
> phonetic purposes is a bit of a hack too.

However, there used to be typographical distinctions in certain German
fonts between umlaut and diaeresis: see the examples on p. 15 of Victor
Gaultney's paper "Problems of diacritic design for Latin script text
faces" at http://www.sil.org/~gaultney/ProbsOfDiacDesignLowRes.pdf
(warning: 1.4M), particularly Figure 39.

> The alternative proposed by DIN, creating a new COMBINING UMLAUT
> character, would have caused *unprecedented and catastrophic*
> equivalence and normalization problems.

Indeed.

-- 
"Take two turkeys, one goose, four  John Cowan
cabbages, but no duck, and mix them http://www.ccil.org/~cowan
together. After one taste, you'll duck  [EMAIL PROTECTED]
soup the rest of your life."http://www.reutershealth.com
--Groucho

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-09 Thread John Cowan

Peter Kirk scripsit:

> I have just reviewed this list and found it odd that Hebrew presentation 
> forms are included but Arabic ones are not. 

The specification actually called only for Latin, Greek, and Cyrillic;
I added Hebrew pour la lagniappe.  If someone wants to add Arabic, I
encourage them to do so.

> the Hebrew presentation forms but also most of the precomposed 
> characters are redundant in this list. 

True; however, the current list indicates the scope of what actually
happens, even if it is overlong.

> It is therefore
> necessary to list in the specification of the folding only all (?) 
> combining marks, which are to be deleted, 

I believe that all Mn-class characters, and only they, are deleted by this.

> I note that 0429 is not folded to 0428 etc, and this is correct because 
> within the Cyrillic writing system these are entirely separate 
> characters. But the difference between these two is in fact exactly the 
> same descender which is removed in 0496 etc.

I don't think that matters.  Long historical practice has made SHCHA a
separate letter, just as G, J, U, and W are now separate Latin letters
from C, I, V, and VV-ligature.

> I am also surprised to note 
> that no folding is given for 0419/0439; although in some ways this is 
> desirable because Russians do not consider this breve to be a diacritic 
> (and after all we would not want the dot on i to be removed as a 
> diacritic!), these characters have canonical decompositions to 0418/0438 
> and breve and the principle of canonical equivalence and the folding 
> algorithm (which works on decomposed characters) more or less demand 
> that the breve be deleted. Also 048A/048B should then fold to 0418/0438 
> rather than 0419/0439.

I think I agree with this: i-breve does not have the same universal status as
shch.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
'Tis the Linux rebellion / Let coders take their place,
The Linux-nationale / Shall Microsoft outpace,
We can write better programs / Our CPUs won't stall,
So raise the penguin banner of / The Linux-nationale.

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-09 Thread John Cowan

Jony Rosenne scripsit:

> I doubt it makes much sense to the casual reader. Witness how nearly every
> radio and television pronounces New Delhi as New Del-hi.

O pity the poor poor Zippity,
For he can eat nothing but Greli,
   A plant that grows only
   In New Caledony,
While the Zippity lives in New Delhi.
--Shel Silverstein

-- 
"Take two turkeys, one goose, four      John Cowan
cabbages, but no duck, and mix them http://www.ccil.org/~cowan
together. After one taste, you'll duck  [EMAIL PROTECTED]
soup the rest of your life."http://www.reutershealth.com
--Groucho

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-06 Thread John Cowan

Patrick Andries scripsit:

> >So the change is more like Beijing -> Peking than Berlin -> Kitchener. 
> 
> Without a political change Constantinople would not have changed name in 
> a matter of days (at least as far as the officials were concerned). In 
> any case, it is not a transliteration problem (Beijing --> Pékin).

Not just a transliteration problem, either:  Mandarin Chinese underwent
a sound-shift in the 17th century that changed the second syllable from
"ging" to "jing", but the English name was already set (and the change
did not affect Southern Sinitic in any case; cf. Cantonese "pak king").

In addition, when it isn't the capital (bei jing = "North-capital"),
i.e. 1928-49, its name is Beiping ("north-peace").

-- 
Here lies the Christian,John Cowan
judge, and poet Peter,  http://www.reutershealth.com
Who broke the laws of God   http://www.ccil.org/~cowan
and man and metre.  [EMAIL PROTECTED]

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-06 Thread John Cowan

Peter Kirk scripsit:

> Well, did Gdansk/Danzig change its name backwards and forwards several 
> times over history (thank you, Qrczak, for the interesting information 
> about that), or was it simply that it had different names in different 
> languages?

Yes to both.  Its name in Polish is Gdan'sk, in German Danzig.  Which one is
the dominant name is determined by which power is dominant at a given time.
What foreigners call the city is influenced, though not determined, by
when the city first became important to them.

There is hardly a city in Europe that isn't like this.  What makes this
one special, though hardly unique, is the repeated changes of sovereignty.
Consider Strassburg/Strasbourg.

> This makes it not a transliteration problem but a translation 
> problem, one which is common to many geographical names - sometimes the 
> names in different languages are related, and sometimes they are not 
> e.g. Turku/Åbo in Finland, or Yerushalayim/al-Quds, or Dublin/(I'll let 
> Michael tell us the correct Irish form).

Baile Atha Cliath.  Dublin is also an Irish name, though used mostly by
Norse and English (and now by anglophone Irish, of course).

-- 
My confusion is rapidly waxing  John Cowan
For XML Schema's too taxing:[EMAIL PROTECTED]
I'd use DTDshttp://www.reutershealth.com
If they had local trees --  http://www.ccil.org/~cowan
I think I best switch to RELAX NG.

Re: is "n with tilde" used in French language ?

2004-07-04 Thread John Cowan

Stefan Persson scripsit:

> I have only seen ñ in old French; however, old French also uses tilde 
> above lots of other characters, such as all vowels (ã?õ?) and a 
> lot of consonants, e.g. q?? (for the old spelling of "que").  Instead of 
> writing an "n", you often put a tilde over the letter preceding the "n". 
> So e.g. "France" was "Frãce."  I believe that this spelling was used 
> until about the time of the French revolution.

In origin the tilde *was* a degenerate "n", of course.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.ccil.org/~cowan  http://www.reutershealth.com
Thor Heyerdahl recounts his attempt to prove Rudyard Kipling's theory
that the mongoose first came to India on a raft from Polynesia.
--blurb for Rikki-Kon-Tiki-Tavi

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-04 Thread John Cowan

Philipp Reichmuth scripsit:

> "Chykoffskee" is pretty accurate, actually.

Thank you.  I have long since forgotten all the (very small amount of)
Russian I ever learned, but I retain a firm grip on its phonology due to
an interesting paedagogical device.  My Russian instructor spent the first
week or so of class teaching us to speak English with a Russian accent
(and this I can do to this day).  The idea was that having mastered this,
we could then begin to speak Russian as well with a Russian accent,
which is to say, perfectly.

> I'd say Tchaikovsky is just 
> a spelling taken over from French at a time when French was pretty much 
> the international common language at least in diplomacy and art.

Doubtless.  I have even seen it spelled in German fashion in English a
time or two.

-- 
I suggest you call for help,    John Cowan
or learn the difficult art of mud-breathing.[EMAIL PROTECTED]
--Great-Souled Sam  http://www.ccil.org/~cowan

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-04 Thread John Cowan

Doug Ewell scripsit:

> On the contrary, untransliterated (or untranscribed) text can only be
> read by people who know the original script.  Transliterations and
> transcriptions at least give the Latin-script-only reader a fighting
> chance to pronounce the text.  

Transliterations don't work so well for that, but transliterating some
scripts to Latin is a necessity (for me, at least) to even recognize them.
I can cope with Greek, Hebrew, and Cyrillic, but an English text full
of Arabic or Chinese names presented in the usual scripts for those
languages would be hopeless -- I wouldn't be able to reliably tell one
name from another.

This is true even though I have no more Greek, Hebrew, or Russian than
I have Arabic or Chinese.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
"If he has seen farther than others,
it is because he is standing on a stack of dwarves."
--Mike Champion, describing Tim Berners-Lee (adapted)

Re: Looking for transcription or transliteration standards latin- >arabic

2004-07-02 Thread John Cowan

Jony Rosenne scripsit:
> Transcription does not require roundtrip. It is intended in this case for
> the English speaker to be able to deliver an approximate pronunciation
> adapted to his native vocal capabilities.

Except when it doesn't.  We write Tchaikovsky, not Chykoffskee.

-- 
"I could dance with you till the cows   John Cowan
come home.  On second thought, I'd  http://www.ccil.org/~cowan
rather dance with the cows when you http://www.reutershealth.com
came home."  --Rufus T. Firefly [EMAIL PROTECTED]

Re: Greek tonos and oxia

2004-06-30 Thread John Cowan

Peter Kirk scripsit:

> Since the characters are in fact exactly equivalent, you can use 
> whichever you wish, as long as you are aware that some processes may 
> change one to the other. They should be rendered identically.  

True.  But the original question was "Which are preferred", and there
is a definite answer to that.

> But, in favour of using the versions from the Extended Greek sets,
> there are a number of fonts around which render the versions in the
> main Greek and Coptic block (or has it been officially renamed just
> "Greek"?) with a vertical tonos,

Quite so.  In general, though, we should encode text correctly and then
use correct fonts, rather than adjusting our encoding to the vagaries
of erroneous or obsolete fonts.  Unicode 2.0 fonts also have the problem
that they produce the wrong forms for theta and phi in running text.

-- 
"In my last lifetime,   John Cowan
I believed in reincarnation;http://www.ccil.org/~cowan
in this lifetime,   [EMAIL PROTECTED]
I don't."  --Thiagi http://www.reutershealth.com

Re: Greek tonos and oxia

2004-06-30 Thread John Cowan

Michael Everson scripsit:
> At 14:11 -0400 2004-06-30, John Cowan wrote:
> 
> >But the X WITH ACUTE characters there are exactly equivalent to the 
> >X WITH TONOS characters in the main Greek block, and the ones in the 
> >main Greek block are in fact preferred.
> 
> How can you tell they are preferred, John?

Because normalization changes the latter to the former, as a result of the
one-way nature of (singleton) compatibility equivalence.

-- 
John Cowan  [EMAIL PROTECTED]  http://www.ccil.org/~cowan
Does anybody want any flotsam? / I've gotsam.
Does anybody want any jetsam? / I can getsam.
--Ogden Nash, No Doctors Today, Thank You

Re: Greek tonos and oxia

2004-06-30 Thread John Cowan

Peter Kirk scripsit:

> If you prefer to use precomposed characters (rather than separate 
> diacritics as Ken suggested) or need to do so to meet W3C 
> recommendations, you should use the ones in the Extended Greek section, 
> which allow for a distinction between acute and grave accents which is 
> important for Classical Greek.

Many of the characters in the Extended Greek block are indeed
essential to polytonic Greek.  But the X WITH ACUTE characters there
are exactly equivalent to the X WITH TONOS characters in the main Greek
block, and the ones in the main Greek block are in fact preferred.

This can be determined by looking at the normalization rules, which will
change all X WITH ACUTE characters to the corresponding X WITH TONOS
characters.

> You may like to look at Nick Nicholas' Greek Unicode site at 
> http://ptolemy.tlg.uci.edu/~opoudjis/unicode/unicode.html, which 
> discusses these issues.

Indeed.

-- 
"In my last lifetime,   John Cowan
I believed in reincarnation;http://www.ccil.org/~cowan
in this lifetime,   [EMAIL PROTECTED]
I don't."  --Thiagi http://www.reutershealth.com

Re: decent unicode capable web app editor

2004-06-16 Thread John Cowan

Edward H. Trager scripsit:

> What about vim (vi clone: http://www.vim.org).  I just converted
> a very large UTF-8-encoded HTML document (more than 15000
> lines) to standards-compliant XHTML-1.0 and found the advanced
> regular-expression-based substitution facilities in vi(m) absolutely
> indispensible for adding all of the closing tags that XML requires
> which were missing in my original document.

HTML Tidy or TagSoup would probably have served you better, rather than
groveling over the code bit by bit.  (HTML Tidy can do more cleaning,
but it sometimes loops or delivers garbage if the HTML is sufficiently
broken.  TagSoup never gives up and never loops, but doesn't know
as much about HTML.)

-- 
Said Agatha Christie / To E. Philips Oppenheim  John Cowan
"Who is this Hemingway? / Who is this Proust?   [EMAIL PROTECTED]
Who is this Vladimir / Whatchamacallum, http://www.reutershealth.com
This neopostrealist / Rabble?" she groused. http://www.ccil.org/cowan
--author unknown to me; any suggestions?

Re: Bantu click letters

2004-06-10 Thread John Cowan

Michael Everson scripsit:

> Unless one contacted whomever it is who owns "Bantu Studies" and 
> simply *asked*.

Carfax (part of the Taylor and Francis Group).

Here's contact information:

Reprints, permissions + electronic rights   

Joanne Nerland
Taylor & Francis
PO Box 2562 Solli
N-0202 Oslo
Norway
+ 47 22 12 9880
or: +47 22 12 9884
Mobile: +47 90 11 3974
+47 22 12 9890

But Gutenberg may not care: they mostly (now exclusively?) publish texts
in the public domain.

-- 
John Cowanhttp://www.ccil.org/~cowan  [EMAIL PROTECTED]
Please leave your valuesCheck your assumptions.  In fact,
   at the front desk.  check your assumptions at the door.
 --sign in Paris hotel   --Cordelia Vorkosigan

Re: Bantu click letters

2004-06-10 Thread John Cowan

Michael Everson scripsit:

> > > Effort and expense was made to cut the letters for the publication.
> >
> >And today, if I were reprinting it, I'd commission a digital font
> >(your effort, my expense) and put the characters in the PUA.
> 
> Not if you wanted, as an Africanist, to be able to represent the text 
> as it was originally written.

We must be talking past one another somehow, but I don't understand how.
To represent the text as originally written, I need a digital representation
for each of the characters in it.  Since all I want to do is reprint
the book -- I don't need to use the unusual characters in interchange --
the PUA and a commissioned font seem just perfect to me.

> You don't know whether or not they were only used in a single 
> document. You know only that I *own* that single document. You are 
> declaring the characters guilty until proved innocent. That's 
> antagonistic.

I intend no antagonism.

We treat the Phaistos-disk characters as guilty until proven innocent,
for the same reason -- there's only one text.  (It's also true that
we can't interpret them, which is additional evidence against them.)
There's no *point* in encoding the PD characters because they aren't
used in interchange -- see above.

> >If I decided to start using thorn instead of theta in my otherwise
> >IPA transcriptions, that would be an idiosyncratic use of it.
> 
> Plenty of Germanist transcriptions use thorn. In any case, the 
> analogy isn't relevant, as both thorn and theta are encoded and 
> available for use.

I was talking about what it means to be idiosyncratic.  (Not that
either of us need any real instruction on the subject!)

> >(LATIN LETTER OWL, indeed.)
> 
> COMBINING SEAGULL BELOW, indeed.

LATIN LETTER OI, indeed.  :-)

> [OWL] is interesting, by the way. Asmus says it's similar to 
> something the Japanese use for telephone answering machines. I don't 
> know about that, though it looks familar to me. I wonder what Doke's 
> source for it was.

It looks to me the sort of thing that would be easy to reinvent.
Some of my habitual doodles are much like it.

> I was astonished because I hadn't seen them before. That does not 
> mean I didn't believe that they weren't worthy of encoding. Just 
> because I hadn't seen them before doesn't mean they don't exist and 
> aren't worthy of encoding either. Khoisian phonology is rather 
> esoteric, after all.

Sure.  I was addressing the question of the *novelty* of the characters.
If neither you nor I nor anyone else in this community has seen them
before, they are most certainly novel.

> I am gobsmacked. On what grounds are these not characters? They are 
> not glyph representations of other characters. 

They *are* characters.  It's just not useful to encode them, any more
than it's useful to encode most of the scripts in the Conscript Registry.

Find more documents, and the picture changes.  (Find more Phaistos-type
disks, and that picture changes too.)

-- 
If you have ever wondered if you are in hell, John Cowan
it has been said, then you are on a well-traveled http://www.ccil.org/~cowan
road of spiritual inquiry.  If you are absolutely   http://www.reutershealth.com
sure you are in hell, however, then you must be [EMAIL PROTECTED]
on the Cross Bronx Expressway.  --Alan Feuer, NYTimes, 2002-09-20

Re: Bantu click letters

2004-06-10 Thread John Cowan

Michael Everson scripsit:

> Although Pullum and Ladusaw 
> don't show the glyphs, they refer specifically to Doke's characters 
> (s.v. ///). They describe them as "ad hoc" which I suppose the were, 
> in 1925, though "novel" would do as well as they aren't entirely 
> arbitrary and they weren't "found" bits of lead type pressed into 
> other service -- they were cut to order.

If Sequoyah had had clout, we'd probably be using his original
characters for Cherokee today.

> That Pullum and Ladusaw have not forgotten Doke's characters suggests 
> that Africanists will also likely not forget them, and will find use 
> in access to them as encoded characters in the UCS.

It's P&L's business to remember what would otherwise be (mercifully,
in some cases) forgotten, so that people who need to interpret old
documents have some hope of doing so.

What we need is more evidence: either documentary evidence, or the
evidence of breathing Africanists.

-- 
John Cowan  <[EMAIL PROTECTED]>
http://www.ccil.org/~cowan  http://www.reutershealth.com
Charles li reis, nostre emperesdre magnes,
Set anz totz pleinz ad ested in Espagnes.

Re: Bantu click letters

2004-06-10 Thread John Cowan

Michael Everson scripsit:

> They were published in Bantu Studies in 1925 in an article by a 
> rather important scholar in the field of African linguistics.  

We don't encode characters according to the clout of the user, or
the Apple logo would have been in Unicode long since. :-)

> Effort and expense was made to cut the letters for the publication.

And today, if I were reprinting it, I'd commission a digital font
(your effort, my expense) and put the characters in the PUA.

> The sounds they represent are idiosyncratic and difficult to 
> describe, much less write. 

I think that characters used in a single document by a single scholar,
however prestigious, can fairly be described as idiosyncratic to him.
If I decided to start using thorn instead of theta in my otherwise
IPA transcriptions, that would be an idiosyncratic use of it.  If
instead I used OVERCLOCKED HOOCHIMADINGER SYMBOL, that would be
even more idiosyncratic.

(LATIN LETTER OWL, indeed.)

> Personal? No: he published.

Fair enough.

> Novel? Perhaps
> (in 1925); Doke is likely to have devised them. 

They are just as novel today as they were eighty years ago; I well
remember how astonished you and I were, looking over the text.

> Private use? Be
> serious, John. That's a pretty ridiculous suggestion.

I am serious.  The PUA is the proper place for these things.

-- 
"May the hair on your toes never fall out!" John Cowan
--Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]

Re: Bantu click letters

2004-06-10 Thread John Cowan

Michael Everson scripsit:

> Proposal to add Bantu phonetic click characters to the UCS
> http://www.evertype.com/standards/iso10646/pdf/n2790-clicks.pdf

[T]he Unicode Standard does not encode idiosyncratic,
personal, novel, or private use characters [...].

Whatever may have been done in the past, I don't think that one
document is enough to support the introduction of new Latin letters;
these look extremely idiosyncratic, personal, novel and private use
to me.

-- 
All Norstrilians knew what laughter was:    John Cowan
it was "pleasurable corrigible malfunction".http://www.reutershealth.com
--Cordwainer Smith, Norstrilia  [EMAIL PROTECTED]

Re: Revised Phoenician proposal

2004-06-08 Thread John Cowan

Peter Constable scripsit:

> > > In that sense, treating Phoenician as a script variant of Hebrew
> > > is a big win for many of the users of the script, since they
> > > would have a hard time deciphering the bizarre (to them) script
> > > variant but have no problem reading texts originally written in
> > > it in different fonts.
> 
> I didn't understand that statement the first time round, and still am
> not sure I understand it. (The antecedent for the last occurrence of
> "it" isn't clear to me, so I'm having difficulty interpreting the whole
> thing, apart from the matter of whether the point makes sense.)

I interpret it to mean that if you know Hebrew, you can read text in
Old Hebrew or Phoenician or whatever, provided you can get past the
script barrier.  For such people, there is some advantage in encoding
these old texts with Hebrew characters, since a simple font change will
convert between the authentic and the intelligible.

By the same token, there would be some advantage to Croats wishing to
read Serbian if it's encoded in an encoding that can be rendered with
either Latin or Cyrillic letters (or digraphs); such a thing could
easily be constructed and mapped to Unicode, thanks to the Croat-specific
digraph compatibility characters present.  That wouldn't make such
an encoding a Good Thing in the wider world, though.

-- 
He made the Legislature meet at one-horse   John Cowan
tank-towns out in the alfalfa belt, so that [EMAIL PROTECTED]
hardly nobody could get there and most of   http://www.reutershealth.com
the leaders would stay home and let him go  http://www.ccil.org/~cowan
to work and do things as he pleased.--Mencken, Declaration of Independence

OT: John Cowan announces the "Unix Power Classic"

2004-06-03 Thread John Cowan

My apologies for this cross-post, but I just can't tell which of you
will be interested in my latest effort: the Unix Power Classic,
an evolving hacker-oriented version of the Tao Te Ching.
See http://www.ccil.org/~cowan/upc .

Please don't reply on-list, but directly to [EMAIL PROTECTED]  Thanks.

-- 
Schlingt dreifach einen Kreis vom dies!John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau, http://www.reutershealth.com  
Denn er genoss vom Honig-Tau,  http://www.ccil.org/~cowan  
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)

Re: Proposal to encode dominoes and other game symbols

2004-06-02 Thread John Cowan

Andrew C. West scripsit:

> And perhaps Michael would be kind enough to prepare a proposal for
> traffic signs if you asked nicely ;)

Yes, but it will be only four lines long.  :-)

> H.7 Some criteria weaken the case for encoding

A few of these criteria seem a bit flaky to me.

> There is evidence that
> -- the symbol is primarily used free-standing (traffic signs)
> -- the notational system is not widely used on computers (dance notation,
> traffic signs)

So it looks like there are at least two reasons to shoot down traffic
signs already.  OTOH, lots of things are not widely used on computers
precisely because they have no standard representation (minority scripts
being the obvious case.)

> -- the symbol is purely decorative

This would seem to exclude dingbats altogether.

> -- the identity of the symbol is usually ignored in processing

Eh?

> H.10 Perceived usefulness
> The fact that a symbol merely seems to be useful or potentially
> useful is precisely not a reason to code it. Demonstrated usage, or
> demonstrated demand, on the other hand, does constitute a good reason
> to encode the symbol.

Amen.

-- 
The Imperials are decadent, 300 pound   John Cowan <[EMAIL PROTECTED]>
free-range chickens (except they have   http://www.reutershealth.com
teeth, arms instead of wings andhttp://www.ccil.org/~cowan
dinosaurlike tails).--Elyse Grasso

Phoenician character properties

2004-05-29 Thread John Cowan

If the Phoenician numbers work like Arabic digits (except for not being
positional decimal, of course), shouldn't they have bidi type AN?

Is strong RTLness really required for PHOENICIAN WORD SEPARATOR?  If not,
it can be unified with MIDDLE DOT.

-- 
"Do I contradict myself?    John Cowan
Very well then, I contradict myself.[EMAIL PROTECTED]
I am large, I contain multitudes.   http://www.ccil.org/~cowan
--Walt Whitman, Leaves of Grass http://www.reutershealth.com

Re: [BULK] - Re: Vertical BIDI

2004-05-28 Thread John Cowan

Mark Davis scripsit:

> > As things now stand, Ogham must be wrapped in RLO...PDF brackets when
> > mixed with vertical Han or Mongolian.
> 
> Yes, that's true -- and I don't see any reason why people can't live with
> that... Those are the kinds of reasons we have the explicit controls.

Because horizontal vs. vertical should be under the control of a higher-level
protocol such as CSS.  In order for this to work properly, CSS has to
"reach down inside" the bidi algorithm and jimmy it; or, to put it
another way, the character-level representation of Ogham has to have
knowledge of what overall directionality it is going to be imbedded in.
Both are Bad Things.

-- 
There are three kinds of people in the world:   John Cowan
those who can count,http://www.reutershealth.com
and those who can't.[EMAIL PROTECTED]

An annoying ambiguity about which nothing can be done now

2004-05-28 Thread John Cowan

The phrase "COMBINING DOUBLE" in a Unicode character can mean either
that the diacritical mark is doubled with respect to some other mark
(DOUBLE ACUTE, DOUBLE VERTICAL LINE ABOVE, DOUBLE GRAVE, DOUBLE LOW LINE,
DOUBLE OVERLINE, DOUBLE VERTICAL LINE BELOW, DOUBLE VERTICAL STROKE
OVERLAY) or else that it extends over two characters (DOUBLE BREVE,
DOUBLE MACRON, DOUBLE MACRON BELOW, DOUBLE TILDE, DOUBLE INVERTED BREVE,
DOUBLE RIGHTWARDS ARROW BELOW).  Of coure MUSICAL SYMBOL COMBINING DOUBLE
TONGUE is something else again.

Thank you.  I feel much better now.

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
There are books that are at once excellent and boring.  Those that at
once leap to the mind are Thoreau's Walden, Emerson's Essays, George
Eliot's Adam Bede, and Landor's Dialogues.  --Somerset Maugham

Re: [BULK] - Re: Vertical BIDI

2004-05-27 Thread John Cowan

Mark Davis scripsit:

> What the Bidi Algorithm says both of these is at:
> 
> http://www.unicode.org/reports/tr9/#Vertical_Text

However, it does not specify the treatment of Ogham embedded in TTB
text, since Ogham is the only script with both a required horizontal
direction (LTR) and a required vertical one (BTT).

As things now stand, Ogham must be wrapped in RLO...PDF brackets when
mixed with vertical Han or Mongolian.

-- 
One art / There is  John Cowan <[EMAIL PROTECTED]>
No less / No more   http://www.reutershealth.com
All things / To do  http://www.ccil.org/~cowan
With sparks / Galore -- Douglas Hofstadter

Re: Phoenician, Fraktur etc

2004-05-27 Thread John Cowan

Mark E. Shoulson scripsit:

> Just for some more confusion to add, I note that with the distaste later 
> Pharisaic Judaism had for the Old Hebrew script, there comes a fairly 
> well-accepted, if unsupportable, thesis that the Law was actually 
> *originally* given in Square Hebrew ("Assyrian Script"), which was then 
> changed/forgotten when Israel sinned, and later still restored.  See 
> http://www.sacred-texts.com/jud/t08/t0805.htm for some Talmudic 
> discussion of the matter.

It's interesting, though, that the story given first is the true one:
first Hebrew with PH script, then Aramaic with Square script; finally
the Jews wind up with the law in Square Hebrew and the Samaritans
with the law in Aramaic using PH script.

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
"'My young friend, if you do not now, immediately and instantly, pull
as hard as ever you can, it is my opinion that your acquaintance in the
large-pattern leather ulster' (and by this he meant the Crocodile) 'will
jerk you into yonder limpid stream before you can say Jack Robinson.'"
--the Bi-Coloured-Python-Rock-Snake

Re: Character Foldings

2004-05-26 Thread John Cowan

Mark Davis scripsit:

> >LATIN CAPITAL LETTER O WITH STROKE => O + /

That road quickly takes you to RFC 1345; I have some code for cracking
1345's format, and I should be able to prepare a mapping table fairly
soon.

-- 
My confusion is rapidly waxing  John Cowan
For XML Schema's too taxing:[EMAIL PROTECTED]
I'd use DTDshttp://www.reutershealth.com
If they had local trees --  http://www.ccil.org/~cowan
I think I best switch to RELAX NG.

Re: Proposal to encode dominoes and other game symbols

2004-05-25 Thread John Cowan

Eric Muller scripsit:

> >A suggestion for playaing cards: why not including the "Tarots"?
> >I mean in French the 4 "Cavaliers" figures, the 18 "Atouts", and the 
> >"Excuse"
> >(which is not exactly a Joker); sorry I don't have their English names.
> >
> Make that 21 atouts (labeled "1" through "21"), for a total of 78 cards. 
> The "cavalier" is between the jack and the queen. Very popular game in 
> high school and college in my days.

"Trumps" in English.  I suggest that 21 trumps be encoded, but not
named, because the correspondence of names to numbers is variable.
Suitable names would be PLAYING CARD TRUMP I through PLAYING CARD TRUMP XXI.
The Fool, the 22nd or un-numbered trump, is the direct ancestor of
the Joker and should be unified with it.

The fourth court card in each suit is called the Knight in English;
these should also be encoded.

This would call for 25 more cards.  Documentation on the Tarot is
extensive and readily available.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
'Tis the Linux rebellion / Let coders take their place,
The Linux-nationale / Shall Microsoft outpace,
We can write better programs / Our CPUs won't stall,
So raise the penguin banner of / The Linux-nationale.

Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)

2004-05-25 Thread John Cowan

Doug Ewell scripsit:

> > So is [VIQR] a 7-bit encoding, or a scheme layered on top of ASCII?
> 
> It's a scheme layered on top of ASCII
> > And what is KOI-7?
> 
> A true 7-bit encoding for Russian, in which Cyrillic letters (small and
> capital respectively) were encoded in the ranges where ASCII has Latin
> letters (capital and small respectively).

Ah.  And on what principle do you distinguish them?  The IETF clearly
treats them both as charsets, within its definitions.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
"You cannot enter here.  Go back to the abyss prepared for you!  Go back!
Fall into the nothingness that awaits you and your Master.  Go!" --Gandalf

Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)

2004-05-24 Thread John Cowan

Doug Ewell scripsit:

> Truye^.n cu?a o^ng la` nhu+~ng bo^. nho+' ghi la.i mo^.t ca'ch so^'ng
> ddo^.ng nhu+~ng sinh hoa.t dda(.c bie^.t cu?a no^ng tho^n Vie^.t Nam
> ca'ch dda^y nu+?a the^' ky?.  Ta ye^u me^'n da^n to^.c ta\.

So is this a 7-bit encoding, or a scheme layered on top of ASCII?  And
what is KOI-7?

-- 
"Clear?  Huh!  Why a four-year-old childJohn Cowan
could understand this report.  Run out  [EMAIL PROTECTED]
and find me a four-year-old child.  I   http://www.ccil.org/~cowan
can't make head or tail out of it." http://www.reutershealth.com
--Rufus T. Firefly on government reports

Re: Response to Everson Phoenician and why June 7?

2004-05-24 Thread John Cowan

James Kass scripsit:

> Well, I don't think it would be cavalier in any sense to use a 
> transliteration font.  Hardly antiquarian or throwback, either.
> 
> But, I don't for a minute think it's the proper thing to do.
> I think it would be silly and churlish.  

I'm more of a ceorl than a chevalier, myself.  Strictly foot-bound
peasant stock.

> those who wish to do so aren't bound by my opinions, eh?

The widespread use (as opposed to the mere existence) of a Phoenician
encoding in Unicode imposes costs on at least some Semiticists that
they do not wish to pay, at least without some assistance from Unicode.
Hence my desire to have Phoenician and Hebrew collate together at the
first level (more for searching than for sorting).

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
The Penguin shall hunt and devour all that is crufty, gnarly and
bogacious; all code which wriggles like spaghetti, or is infested with
blighting creatures, or is bound by grave and perilous Licences shall it
capture.  And in capturing shall it replicate, and in replicating shall
it document, and in documentation shall it bring freedom, serenity and
most cool froodiness to the earth and all who code therein.  --Gospel of Tux

Re: Response to Everson Phoenician and why June 7?

2004-05-24 Thread John Cowan

Dean Snyder scripsit:

> It would be like testing readers of Roman German who had
> never read Fraktur - they wouldn't recognize it as a font change either
> (which it is, of course, in Unicode).

I see the words "The New York Times" in Fraktur (more or less) every day.
It's obviously a font variant of Latin.

-- 
Business before pleasure, if not too bloomering long before.
--Nicholas van Rijn
John Cowan <[EMAIL PROTECTED]>
http://www.ccil.org/~cowan  http://www.reutershealth.com

Re: ISO 15924 draft fixes

2004-05-21 Thread John Cowan

Philippe Verdy scripsit:

> > Please go to Langues'O for this commentary. As I wrote, you will be
> > probably answered with the historical context.
> 
> C'est quoi Langues'O ? Où est-ce ?

Please forgive me for intruding into an internal francophone matter, but
whenever I see "Langues'O", my mind insists on correcting it into
"Langues d'O", as in "Histoire d'O".  Not that I read French.

-- 
John Cowan  [EMAIL PROTECTED]http://www.reutershealth.com
"Not to know The Smiths is not to know K.X.U."  --K.X.U.

Re: Response to Everson Phoenician and why June 7?

2004-05-20 Thread John Cowan

Kenneth Whistler scripsit:

> The question is rather, given the fundamental nature of the
> Unicode Standard as enabling text processing for modern
> software, it is cost-effective and *reasonable* to provide
> a Unicode encoding for one particular script or another,
> unencoded to date, so as to maximize the chances that it
> will be handled more easily by modern software in the global
> infrastructure and to minimize the costs associated with
> doing so.

These words (and indeed your entire posting) deserve to be written
up in letters of gold somewhere.

-- 
LEAR: Dost thou call me fool, boy?  John Cowan
FOOL: All thy other titles  http://www.ccil.org/~cowan
 thou hast given away:  [EMAIL PROTECTED]
  That thou wast born with. http://www.reutershealth.com

Re: Vertical BIDI

2004-05-19 Thread John Cowan

Andrew C. West scripsit:

> The only thing that is certain is that Ogham must be rendered BTT in
> vertical contexts. For Ogham text in isolation this is fairly easy to
> accomplish by simple rotation, and one could expect "writing-mode
> : bt-rl" or "writing-mode : bt-lr" to accomplish this in a CSS
> stylesheet. Whether the columns should run LTR or RTL across the page
> is another question, although LTR would be simplest to implement as
> it would only involve rotating a whole block of horizontal LTR Ogham
> text 90 degrees anticlockwise. At any rate, vertical presentation is
> a matter for a higher protocol, and not a Unicode matter.

I think it's clear by now that bt-lr is the Right Thing.  (A great pity
that the Irish monks didn't record horizontal Ogham RTL!  If you are
standing in front of an Ogham-inscribed archway, the curve of the text
does pass from your right side to your left side (and the same for a
standing stone if you in imagination flatten out the sides), and the
monks must have had *some* familiarity with Hebrew or Arabic.)

> However, Ogham text embedded in Mongolian may be a different matter. If
> a plain text editor renders everything horizontally, as most do, then
> both Mongolian and Ogham should be rendered LTR thus  mongolian>, but if you then select vertical presentation (assuming
> your text editor has this option) Mongolian should be rendered TTB and
> Ogham BTT thus .  I still have no idea as
> to how this should be achieved. My "hack" of using a custom rotated
> Ogham font and RLO/PDF codes would achieve the desired result for
> vertical presentation, but would make the Ogham text RTL for horizontal
> presentation, which is apparently unacceptable. But what alternatives
> are there ?

To introduce a concept of bidi override into stylesheet languages.
You need something like this anyway to handle the case of lr Latin
with embedded Han, where the Latin reads BTT and the Han reads TTB.

Fundamentally, vertical scripts like Han and Mongolian and Ogham have
an essential vertical directionality and a preferred horizontal one
(but they can sometimes tolerate the other direction: RTL Han is not
unknown).  Horizontal scripts have an essential horizontal directionality
and may or may not have a preferred vertical one.

-- 
Long-short-short, long-short-short / Dactyls in dimeter,
Verse form with choriambs / (Masculine rhyme):  [EMAIL PROTECTED]
One sentence (two stanzas) / Hexasyllabically   http://www.reutershealth.com
Challenges poets who / Don't have the time. --robison who's at texas dot net

Re: Vertical BIDI

2004-05-19 Thread John Cowan

Philippe Verdy scripsit:

> > In fact no; both Mongolian (or Manchu, which is unified with it in
> > Unicode) and Chinese are written TTB.
> 
> Then, why did you say that:
> 
> > What's uncertain is whether a lr or a rl progression is favored,
> > given the paucity of evidence.  Michael favors lr progression.
> > There is no question that the text is read BTT.

That statement refers to Ogham, not Mongolian!

Ogham carved on stone is read up one side of the stone, then (if
necessary) across the top of the stone, then (if necessary) down the
other side of the stone.  Now maybe it's just a mistake to assimilate
this scheme to any kind of two-dimensional layout, since all known
instances of Ogham on manuscript are ordinary horizontal L2R, like Latin
(with which it is most often mixed).

The difficulty arises when Ogham is mixed with vertical Han or with
Mongolian, since once the basic directionality becomes vertical, the
tendency to read the Ogham BTT will become automatic.  This is analogous
to the problem that fantasai has pointed out with Latin script written
in lr progression when Han gets mixed in: the normal reading direction
of lr-Latin is BTT, but any Han included will automatically be read TTB,
corrupting it.

*sigh*

One of my favorite lines in the Unicode Standard reads:  "There simply
is no traditional Japanese method of typesetting Devanagari."

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
There are books that are at once excellent and boring.  Those that at
once leap to the mind are Thoreau's Walden, Emerson's Essays, George
Eliot's Adam Bede, and Landor's Dialogues.  --Somerset Maugham

Re: Vertical BIDI

2004-05-18 Thread John Cowan

Philippe Verdy scripsit:

> This creates an interesting problem: Put in the same sentence Han
> (Chinese) and Mongolian words in a vertical layout (I don't think this
> is unlikely, as Mongolian is also spoken in China, and there's also
> a Chinese community in Mongolia). So Chinese ideographs will be laid
> out vertically from top to bottom (but not rotated, except for a few
> characters like ideographic punctuation marks or symbols), and Mongolian
> will be laid out from bottom to top in their normal stack orientation.

In fact no; both Mongolian (or Manchu, which is unified with it in
Unicode) and Chinese are written TTB.  When Mongolian stands alone, the
columns progress from left to right, but when it's mixed with Han, the
columns progress from right to left, as is the case with Chinese alone.
Presumably this is about like writing a Latin-script language with upright
glyphs and LTR, but progressing from the bottom of the page to the top:
annoying but legible.

> Now admit that you want to present it horizontally: Han ideographs will
> not be rotated but will flow on rows from left to right. Suppose you
> have performed the Bidi processing according to the previous vertical
> presentation, then Mongolian stacks will flow from right to left
> (but unlike Han ideographs, they will need to be rotated...)

You don't.  Horizontal Mongolian runs left to right, which means that
with respect to its Aramaic ancestor the glyphs are upside down.

Now when mixing Ogham vertical text with other vertical scripts, you
do indeed need to use RLO ... PDF to force it into bidirectional
behavior, but it's the only such case.

There seem to be two different alternatives for RTL horizontal alphabets
in a vertical context, depending on which way the glyphs are rotated.

-- 
But you, Wormtongue, you have done what you could for your true master.  Some
reward you have earned at least.  Yet Saruman is apt to overlook his bargains.
I should advise you to go quickly and remind him, lest he forget your faithful
service.  --Gandalf John Cowan <[EMAIL PROTECTED]>

Re: ISO-15924 script nodes and UAX#24 script IDs

2004-05-18 Thread John Cowan

Antoine Leca scripsit:

> OTOH, it appears to me (feel free to contradict me, and also to to
> point me the epoch when these things did change) that English habits
> now is to follow the native name and the translitteration rules.

True, although diacritics are still sometimes dropped not on principle but
as a concession (no longer necessary, IMHO) to typographical constraints.
Exactly when this began to be so is vague: sometime in the 20th century,
or perhaps the very last of the 19th.  Certainly at the beginning of
the 19th century anglophones were still respelling foreign words.

> A good example I found recently is the name of Cervantes' main work,
> which short name is "Don Quixote" in English, the same as it was
> in (original) Castilian, while at the same time it was adapted in
> French as "Don Quichotte" (same prononciation as original), and
> similarly in today's Castilian "Don Quijote" (with subsequent change
> in prononciation.) I do not know how English natives will pronounce
> it, however.

Most people say [kihote], since we do not have the Spanish "j", IPA [x].
(Of course the [o] and [e] vowels become diphthongs, as in most varieties
of English.)  I personally made a mild nuisance of myself in the class
where I studied it by insisting on saying [kiSote].  The derived adjective
"quixotic", however, is pronounced in native fashion [kwIksOtIk].

The English poet Byron did not hesitate in his 1821 poem about Don Juan
(Tenorio, that is) to rhyme the hero's name with "new one" and "true
one" in the very first stanza, showing that the pronunciation [EMAIL PROTECTED]
was normal in his time.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Rather than making ill-conceived suggestions for improvement based on
uninformed guesses about established conventions in a field of study with
which familiarity is limited, it is sometimes better to stick to merely
observing the usage and listening to the explanations offered, inserting
only questions as needed to fill in gaps in understanding. --Peter Constable

Re: Vertical BIDI

2004-05-17 Thread John Cowan

fantasai scripsit:

> If another style rule changes the block progression to rl, what should
> happen to the Ogham? Should it now go top to bottom?

It should not.  That's what makes Ogham different from standard
horizontal scripts -- it does have a preferred vertical orientation,
and because turning it upside-down generates different *characters*,
you can't violate that.

> >Also, it's not just punctuation marks that need to get vertical glyphs
> >in vertical formats, it's also things like BOPOMOFO LETTER I.
> 
> Are you sure you're not confusing that with the KATAKANA-HIRAGANA
> PROLONGED SOUND MARK?

Not sure, but I had understood that bopomofo i (which is just one stroke)
was rotated when vertical.

-- 
My corporate data's a mess! John Cowan
It's all semi-structured, no less.  http://www.ccil.org/~cowan
But I'll be carefree[EMAIL PROTECTED]
Using XSLT  http://www.reutershealth.com
On an XML DBMS.

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-17 Thread John Cowan

Andrew C. West scripsit:

> Thus, if "tb-lr" were supported, your browser would display the
> following HTML line as vertical Mongolian with embedded Ogham reading
> top-to-bottom, but in a plain text editor, the Mongolian and Ogham
> would both read LTR, and everyone would be happy :

I don't know about that.  I wouldn't be too happy trying to read English
with the Latin letters laid out bt-rl and lying on their left sides to boot.
On paper is one thing, but on a non-rotatable screen?  I don't think so.

-- 
"We are lost, lost.  No name, no business, no Precious, nothing.  Only empty.
Only hungry: yes, we are hungry.  A few little fishes, nassty bony little
fishes, for a poor creature, and they say death.  So wise they are; so just,
so very just."  --Gollum[EMAIL PROTECTED]  www.ccil.org/~cowan

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-15 Thread John Cowan

Chris Jacobs scripsit:

> So if people pronounce it as
> 
> twenty-one
> esriem we achad
> 
> then they probably indeed write the digit 2 first.

Indeed, but the difficulty is that various Arabic colloquials don't
agree on the order of pronouncing numbers -- and modern standard
Arabic uses the least-significant-digit first style: one and twenty
and three hundred and 

-- 
John Cowan   www.reutershealth.com   www.ccil.org/~cowan   [EMAIL PROTECTED]
Lope de Vega: "It wonders me I can speak at all.  Some caitiff rogue did
rudely yerk me on the knob, wherefrom my wits still wander."
An Englishman: "Ay, a filchman to the nab betimes 'll leave a man  
crank for a spell." --Harry Turtledove, Ruled Britannia

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-15 Thread John Cowan

Jony Rosenne scripsit:

> However, in Hebrew and Arabic, numbers are written left to right and so are
> Latin and other LTR script quotations. So RTL really means mixed direction,
> and the bidi algorithm is there to handle it automatically with little user
> intervention.

BTW, Peter Daniels told me viva voce that arabophones, like persophones and
hebraeophones, do (hand)write numbers LTR starting with the most significant
digit.  But we still have no confirmation from a native arabophone.

And if someone could explain the full significance of the Arabic-Indic
vs. the Eastern Arabic-Indic digits (other than glyph shape), I'd
appreciate it.  I know that the EAI digits work just like the European ones,
whereas the AI digits work differently, but what is the effective difference?

> All of this is completely irrelevant to boustraphedon and vertical scripts.
> These are presentation issues that have not need for Unicode support.

Vertical Ogham does, but forced override is sufficient -- it doesn't need
an *implicit* bidi algorithm.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.reutershealth.comhttp://www.ccil.org/~cowan
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues.  --Cousin James

Re: Interleaved collation of related scripts (was: Phoenician)

2004-05-13 Thread John Cowan

Peter Kirk scripsit:

> >I would have just as many objections to doing that as I would with 
> >unifying it with Hebrew. Users don't expect this kind of interfiling 
> >when looking things up in ordered lists. Interfiling of scripts 
> >impedes legibility.
> 
> Well, I see the point. But presumably the only people who would collate 
> a text containing a mixture of Hebrew and Phoenician, for example, are 
> those who know and understand both scripts. For anyone else this is a 
> matter of garbage in, garbage out. So it should be up to these users to 
> decide whether the legibility concern, which is a real one, is more 
> important than their otherwise expressed preference for interfiling.

In addition, it's important to always remember that "collation" is a
cover term for both sorting *and* searching.  Collating Hebrew with
"Phoenician" at the first level means that a search using Hebrew
letters will find "Phoenician" text as well.

(I am using horror quotes to remind people that Unicode "Phoenician"
includes many non-Punic 22CWSAs, particularly Palaeo-Hebrew.)

If indeed Serbs prefer collation equivalence between Cyrillic and
Latin (which can only be a tailored preference, of course; in general
we don't want to do that), this means not only that they will see
the two interfiled in a sorted list, but also that searching for a
Serbian word in Cyrillic will find it in Latin and vice versa.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan
Female celebrity stalker, on a hot morning in Cairo:
"Imagine, Colonel Lawrence, ninety-two already!"
El Auruns's reply:  "Many happy returns of the day!"

Re: interleaved ordering (was RE: Phoenician)

2004-05-12 Thread John Cowan

Philippe Verdy scripsit:

> Full collation between Phoenician and Hebrew is not really needed:
> the texts are part of separate corpus, and the original documents
> do not mix these scripts in the same words. 

Remember that "Phoenician" in this context includes Palaeo-Hebrew, an
we *have* seen evidence that this script is mixed with Square in the
same text, though not in the same word.

-- 
Evolutionary psychology is the theory   John Cowan
that men are nothing but horn-dogs, http://www.ccil.org/~cowan
and that women only want them for their money.  http://www.reutershealth.com
--Susan McCarthy (adapted)  [EMAIL PROTECTED]

Re: OT [was TR35]

2004-05-12 Thread John Cowan

John Hudson scripsit:
> Jony Rosenne wrote:
> 
> >Mozilla's main value is for non-Windows platforms.
> 
> And for people who are unimpressed by Outlook's security track record.

The main reason I spoke of the Outlook addiction is that (at least as of
the last time I looked at the question) it is practically impossible
to get one's data (saved emails, saved calendar entries, etc.) out of
the Outlook database in usable form.  In particular, emails with
attachments are practically beyond reconstruction.

Mozilla-based email systems use plain mbox/Eudora format, which at least
maintains the emails in a way that's easy to understand.

Me, I use mutt.  GUI-based mail clients are just too slow.

-- 
John Cowan   [EMAIL PROTECTED]  http://www.ccil.org/~cowan
Most languages are dramatically underdescribed, and at least one is 
dramatically overdescribed.  Still other languages are simultaneously 
overdescribed and underdescribed.  Welsh pertains to the third category.
--Alan King

Re: Phoenician

2004-05-11 Thread John Cowan

Christopher Fynn scripsit:

> OTOH applications that generate collated lists should ideally provide  a 
> straightforward  means of  applying  special tailoring tables.

"Should ideally" are the operative words; in most cases, we're lucky
if we get default collating behavior rather than UTF-16 or UTF-8/UTF-32
binary sorting.  That's why it's important what the content of the
default collation is, and that it get things right for at least a
large subset of users.

-- 
Overhead, without any fuss, the stars were going out.
--Arthur C. Clarke, "The Nine Billion Names of God"
John Cowan <[EMAIL PROTECTED]>

Re: OT [was TR35]

2004-05-11 Thread John Cowan

Jony Rosenne scripsit:

> When I travel, I change the time rather than the time zone, because
> changing the time zone causes Outlook to mess up my calendar. This
> causes my e-mails to have a wrong time stamp. Is there any solution
> to this?

AFAIK the only cure is to break the Outlook addiction.

-- 
I suggest you call for help,    John Cowan
or learn the difficult art of mud-breathing.[EMAIL PROTECTED]
--Great-Souled Sam  http://www.ccil.org/~cowan

Everson-bashing (was: Phoenician)

2004-05-10 Thread John Cowan

Peter Kirk scripsit:

> But have the others agreed with his judgments because they are convinced 
> of their correctness? Or is it more that the others have trusted the 
> judgments of the one they consider to be an expert, and have either not 
> dared to stand up to him or have simply been unqulified to do so?  

This is laughable.

> It amazes me that all of the existing scripts have apparently been encoded 
> without any properly documented justification apart from one expert's 
> unchallenged judgments.

It would be amazing if it were true, but of course it's absolutely false.

> And these two cases are hardly a good advertisement for the expert's
> reputation. The Coptic/Greek unification proved to be ill-advised and is
> being undone. As for the unified W and Q, well, I guess that if the
> Kurds and others who use these letters in Cyrillic knew how this
> decision would mean that their alphabet will never be sorted correctly
> (unless they get round to tailoring their collations), they would make a
> strongly argued case for disunification. 

Nobody writes Kurdish in Cyrillic any more: it's a historic use of the
script only.

In any event, Michael had *nothing* to do with those unifications.
He has consistently pressed for disunification (rightly, IMHO).

> Well, perhaps the expert can
> feel how much his fingers have been burned by over-unification and so is
> now pressing for everything to be disunified.

Nonsense, and insulting nonsense to boot.  Michael has never pressed
for either total unification or total disunification, because both
positions are absurd, and his position is never absurd.  (I may
disagree with it from time to time, and I am willing to press him for
reasons, but I *always* respect his point of view.)

This verbal sniping on a subject (the history of character encoding)
you know nothing about is beneath you.  Try and do better.

> And then there is the matter of CJK unification, which I gather is still
> rather contentious.

Only among the invincibly ignorant.

-- 
John Cowan   <[EMAIL PROTECTED]>   http://www.ccil.org/~cowan
"One time I called in to the central system and started working on a big
thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
came by, looked over my shoulder and said 'Oh, that happens to me too.
Try hanging up and phoning in again.'"  --Beverly Erlebacher

Re: Phoenician

2004-05-08 Thread John Cowan

E. Keown scripsit:

> I guess this is a flame, right?
> but what on earth does it mean?
> 
> > Hardly.  If the rest of you hadn't agreed with his
> > judgments most of the time, the Roadmap might look 
> > quite different.  It's more like Potter
> > Stewart on pornography.
> 
> Who's Potter Stewart?  (I don't own a TV).Elaine

*lol*

A former Associate Justice of the U.S. Supreme Court, who memorably
declared in a 1964 concurring opinion that he could not define
pornography, but he knew it when he saw it (and the movie in
question wasn't it).

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
"It's the old, old story.  Droid meets droid.  Droid becomes chameleon. 
Droid loses chameleon, chameleon becomes blob, droid gets blob back
again.  It's a classic tale."  --Kryten, Red Dwarf

Re: Phoenician

2004-05-08 Thread John Cowan

Mark Davis scripsit:

> - But I'm good at it, because invariably when I say it's a tree,
> I agree with myself.

Hardly.  If the rest of you hadn't agreed with his judgments most of the
time, the Roadmap might look quite different.  It's more like Potter
Stewart on pornography.

-- 
John Cowan  www.reutershealth.com  www.ccil.org/~cowan  [EMAIL PROTECTED]
The Penguin shall hunt and devour all that is crufty, gnarly and
bogacious; all code which wriggles like spaghetti, or is infested with
blighting creatures, or is bound by grave and perilous Licences shall it
capture.  And in capturing shall it replicate, and in replicating shall
it document, and in documentation shall it bring freedom, serenity and
most cool froodiness to the earth and all who code therein.  --Gospel of Tux

Re: Default Ordering (was: Re: Phoenician)

2004-05-08 Thread John Cowan

Kenneth Whistler scripsit:

> (Encoded as distinct scripts, by the
> way, despite their clear and evident historic relationship
> to each other, and despite the fact that Japanese can obviously
> read both of them with great facility -- if you guys want to
> take that particular bone in your mouth and chew on it for
> awhile... consider Kana the 48CEAS *hehe*) 

Of course they would have to be.  But if the Japanese had ditched their
kanji and wrote mostly in hiragana, with katakana used very rarely --
say, about as frequent in running text as italicized foreign words in
Latin-script running text -- they might not have bothered to encode
them separately.

> If it turns out to make the most sense for a default table
> to have 22CWSA scripts (as John puts it) sort with interleaved
> primary weights, it is technically feasible to generate a
> table that way. (Although not for Hebrew versus Arabic versus
> Syriac, which are treated distinctly for primary weights now.)

Oh, I quite agree.  Arabic and Syriac are out of the picture here:
too many consonants, too different.

> It isn't a foregone conclusion what the UTC and WG2 will do on
> this issue -- it, like the encoding of the Phoenician
> (~ Old Canaanite, ~ Old West Semitic) script itself, is a
> matter for technical debate and decision.

Which we are now having the preliminary part of.

-- 
You escaped them by the will-death  John Cowan
and the Way of the Black Wheel. [EMAIL PROTECTED]
I could not.  --Great-Souled Samhttp://www.ccil.org/~cowan

Re: Fraser (was RE: Public Review Issues Updated)

2004-05-08 Thread John Cowan

Kenneth Whistler scripsit:

> Fraser is to Latin approximately as Tangut is to Han. It is what
> you get when you create a de novo script for a completely
> different language, but you have a very limited notion of
> what a "letter" is supposed to look like.

I'm sold.  Separate script it is.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
If a traveler were informed that such a man [as Lord John Russell] was
leader of the House of Commons, he may well begin to comprehend how the
Egyptians worshiped an insect.  --Benjamin Disraeli

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 963 matches

Mail list logo