Re: Roundtripping in Unicode
Mike Ayers scripsit: > I thought that URLs were specified to be in Unicode. Am I mistaken? You are. URLs are specified to be in *ASCII*. There is a %-encoding hack that allows you to represent random-octet filenames as ASCII. Some people (including me) think it's a good idea to use this hack to specify non-ASCII characters with double encoding (first as UTF-8, then with the %-hack), but the URI Syntax RFC doesn't say. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.comhttp://www.ccil.org/~cowan Humpty Dump Dublin squeaks through his norse Humpty Dump Dublin hath a horrible vorse But for all his kinks English / And his irismanx brogues Humpty Dump Dublin's grandada of all rogues. --Cousin James
Re: Roundtripping in Unicode
Peter Kirk scripsit: > I think the problem here is that a Unix filename is a string of octets, > not of characters. And so it should not be converted into another > encoding form as if it is characters; it should be processed at a quite > different level of interpretation. Unfortunately, that is simply a counsel of perfection. Unix filenames are in general input as character strings, output as character strings, and intended to be perceived as character strings. The corner cases in which this does not work are not sufficient to overthrow the power and generality to be achieved by assuming it 99% of the time. (A private correspondent has come up with an ingenious trick which depends on being able to create files named 0x08 and 0x7F, but it truly is a trick, and in any case depends only on an ASCII interpretation.) -- Income tax, if I may be pardoned for saying so, John Cowan is a tax on income. --Lord Macnaghten (1901) [EMAIL PROTECTED]
Re: RE: Roundtripping in Unicode
Doug Ewell scripsit: > "When faced with [an] ill-formed code unit sequence while transforming > or interpreting text, a conformant process must treat the first code > unit... as an illegally terminated code unit sequence -- for example, by > signaling an error, filtering the code unit out, or representing the > code unit with a marker such as U+FFFD REPLACEMENT CHARACTER." Plan 9, the original all-UTF-8 environment (it was translated in a single day from Latin-1 to UTF-8), represents ill-formed code unit sequences with the otherwise useless U+0080, on the grounds that an ill-formed code is semantically different from an untranslatable character, which is the purpose of U+FFFD. -- LEAR: Dost thou call me fool, boy? John Cowan FOOL: All thy other titles http://www.ccil.org/~cowan thou hast given away: [EMAIL PROTECTED] That thou wast born with. http://www.reutershealth.com
Re: Nicest UTF
Lars Kristan scripsit: > > I'm using ISO-8859-2. > In fact you're lucky. Many ISO-8859-1 filenames display correctly in > ISO-8859-2. Not all users are so lucky. It was a design point of ISO-8859-{1,2,3,4}, but not any other variants, that every character appears either at the same codepoint or not at all. -- John Cowan[EMAIL PROTECTED] At times of peril or dubitation, http://www.ccil.org/~cowan Perform swift circular ambulation,http://www.reutershealth.com With loud and high-pitched ululation.
Re: Please RSVP... (was: US-ASCII)
Philippe Verdy scripsit: > Didn't know that. Is this a very recent use? It's been used as an English verb, adjective, and noun for 30-40 years and perhaps much longer: see below. > In France, I think that RSVP was introduced and widely used at end of > telegraphic messages (that contained lots of conventional acronyms), it > survived at the time of telex, but now it is renewed with SMS messages on > cellular phones, but is rarely used in emails. > > May be this was introduced in English at the old time of telegraphs as a > useful abbreviation, but with a different meaning when it is used as a > verb for saying "reply as requested"? As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase "The favor of your reply is requested". Later came the "RSVP card", a small card included with the invitation for the invitee to respond with. "An RSVP" of course means "a reply to an invitation marked 'RSVP'." -- My corporate data's a mess! John Cowan It's all semi-structured, no less. http://www.ccil.org/~cowan But I'll be carefree[EMAIL PROTECTED] Using XSLT http://www.reutershealth.com On an XML DBMS.
Re: Nicest UTF
Philippe Verdy scripsit: > And I disagree with you about the fact the U+ can't be used in XML > documents. It can be used in URI through URI escaping mechanism, as > explicitly indicated in the XML specification... You have a hold of the right stick but at the wrong end. U+ can be encoded in a URI as %00, but that does not mean that the IRIs in system ids and namespace names (and potentially other places) can contain explicit U+ characters or � escapes either. Both of those are illegal, and documents that contain them are not well-formed. In character content and attribute values, U+ is not possible. > And the fact that the various character productions, that are normally > normative, have been changed so often, sometimes through erratas that > were forgotten in the text of the next edition of the standard, Do you have evidence for this claim? > The only thing about which I can agree is that XML will forbid surrogates > and U+FFFE and U+, but I won't say that a XML parser that does not > reject NULs or other non-characters or "disallowed" C0 controls is so > much buggy. You are of course entitled to your uninformed opinion. > But all these is also a proof that XML documents are definitely NOT > plain-text documents, so you can't use Unicode encoding rules at the > encoded XML document level, only at the finest plain-text nodes (these > are the levels that the productions in the XML standard are trying, with > more or less success, to standardize). You can't blindly do *normalization* of XML documents as if they were plain text. *Encoding* XML documents according to Unicode is of course possible and desirable. > As a consequence any process that blindly applies a plain-text > normalization to a complete XML document is bogous, because it breaks the > most basic XML conformance, i.e. the core document structure... In one extraordinarily unlikely case, yes: the appearance of a combining overlay slash following the ">" that closes a tag will damage the document if it is NFC-normalized. -- You are a child of the universe no less John Cowan than the trees and all other acyclichttp://www.reutershealth.com graphs; you have a right to be here.http://www.ccil.org/~cowan --DeXiderata by Sean McGrath [EMAIL PROTECTED]
Re: Nicest UTF
Philippe Verdy scripsit: > >Okay, I'm confused. Does ≮ open a tag? Does it matter if it's > >composed or decomposed? > > It does not open a XML tag. > It does matter if it's composed (won't open a tag) or decomposed (will > open a tag, but with a combining character, invalid as an identifier > start) Let's be precise here. If the 7-character character sequence "蠔" appears in an XML document, it never opens a tag and it is never changed by normalization. If the 1-character sequence consisting of a single U+226E appears in an XML document, and that document is put through NF(K)D, it will become not well-formed. However, NF(K)D is not recommended for XML documents, which should be in NFC. -- First known example of political correctness: John Cowan "After Nurhachi had united all the otherhttp://www.reutershealth.com Jurchen tribes under the leadership of the http://www.ccil.org/~cowan Manchus, his successor Abahai (1592-1643) [EMAIL PROTECTED] issued an order that the name Jurchen should --S. Robert Ramsey, be banned, and from then on, they were all The Languages of China to be called Manchus."
Re: Nicest UTF
Philippe Verdy scripsit: > If you look at the XML 1.0 Second Edition The Second Edition has been superseded by the Third. > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x1-#x10] That is normative. > But the comment following it specifies: That comment is not normative and not meant to be precise. > the restrictive > definition of "Char" above also includes the whole range of C1 controls By oversight. > (#x80..#x9F), so I can't understand why the Char definition is so > restrictive on controls; in addition the definition of Char also > *includes* many non-characters (it only excludes surrogates, and U+FFFE > and U+, but forgets to exclude U+1FFFE and U+1, U+2FFFE and > U+2, ..., U+10FFFE and U+10). By oversight again. > Note however that nearly all XML parsers don't seem to honor this > constraint (like SGML parsers...)! Please specify the parsers that do and don't honor this. Any which don't honor it are buggy, and any documents which exploit those bugs are not XML. > What is even worse is that XML 1.1 now reallows NUL for system > identifiers and URIs, through escaping mechanisms. Not true. U+ is absolutely excluded in both XML 1.0 and XML 1.1. -- "I could dance with you till the cows John Cowan come home. On second thought, I'd http://www.ccil.org/~cowan rather dance with the cows when you http://www.reutershealth.com came home." --Rufus T. Firefly [EMAIL PROTECTED]
Re: Nicest UTF
Marcin 'Qrczak' Kowalczyk scripsit: > http://www.w3.org/TR/2000/REC-xml-20001006#charsets > implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of characters. XML processors are required to process UTF-8 and UTF-16, and may process other character encodings or not. But the internal model is that of characters. Thus surrogate code points are not allowed. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] Arise, you prisoners of Windows / Arise, you slaves of Redmond, Wash, The day and hour soon are coming / When all the IT folks say "Gosh!" It isn't from a clever lawsuit / That Windowsland will finally fall, But thousands writing open source code / Like mice who nibble through a wall. --The Linux-nationale by Greg Baker
Re: US-ASCII (was: Re: Invalid UTF-8 sequences)
Kenneth Whistler scripsit: > On the other hand, for many English speakers, "RSVP" is simply > learned as an unanalyzed verb, pronounced "aressveepee", meaning > "send a response to this message". And to castigate such speakers > for politely prepending a "please" to that verb is a little > too much, don't you think? It's also pervasive in English: SALT talks, OPEC countries (or nations), Missisippi River, Gobi Desert. -- "[T]he Unicode Standard does not encode John Cowan idiosyncratic, personal, novel, or private http://www.ccil.org/~cowan use characters, nor does it encode logoshttp://www.reutershealth.com or graphics." [EMAIL PROTECTED]
Re: Nicest UTF
Marcin 'Qrczak' Kowalczyk scripsit: > > The XML/HTML core syntax is defined with fixed behavior of some > > individual characters like '&', '<', quotation marks, and with special > > behavior for spaces. > > The point is: what "characters" mean in this sentence. Code points? > Combining character sequences? Something else? Neither. Unicode characters. -- "May the hair on your toes never fall out!" John Cowan --Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]
Re: Nicest UTF
Marcin 'Qrczak' Kowalczyk scripsit: > String equality in a programming language should not treat composed > and decomposed forms as equal. Not this level of abstraction. Well, that assumes that there's a special "string equality" predicate, as distinct from just having various predicates that DWIM. In a Unicode Lisp implementation, e.g., equal might be char-by-char equality and equalp might not. > They are supposed to be equivalent when they are actual characters. > What if they are numeric character references? Should "≮" > (7 characters) represent a valid plain-text character or be a broken > opening tag? It's a broken opening tag. > Note that if it's a valid plain-text character, it's impossible > to represent isolated combining code points in XML, It's problematic to represent the *specific* combining code point when it appears immediately after a tag. -- Don't be so humble. You're not that great. John Cowan --Golda Meir[EMAIL PROTECTED]
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler scripsit: > A Sybase ASE database has the same behavior running on Windows as > running on Sun Solaris or Linux, for that matter. Fair enough. > UNIX filenames are just one instance of this. However, although they are *technically* octet sequences, they are *functionally* character strings. That's the issue. > Failing that, then BINARY fields *are* the appropriate > way to deal with arbitrary arrays of bytes that cannot > be interpreted as characters. This is purism. All the filenames on my Unix system, for example, can be interpreted as character strings; the potential to create filenames that can't be is unutilized, and sensibly so. For that matter, the potential to create files containing C0 controls is also unutilized. > > in the same way that it would > > be overkill to encode all 8-bit strings in XML using Base-64 > > just because some of them may contain control characters that are > > illegal in well-formed XML. > > Dunno about the XML issue here -- you're the expert on what > the expected level of illegality in usage is there. XML's policy is zero tolerance, both for illegal encodings and for illegal characters such as U+0001. So in order to be *100% sure* that a character string (ASCII, Latin-1, or UTF-*, it matters not) can be put into an XML document, one must treat it as binary and encode it as such, using QP or Base64 or what have you. But nobody does. XML 1.1 allows the representation of every Unicode character except U+, which materially reduces the problem, but there is little support for XML 1.1 as yet. In any case, this case is only an analogy, not an exact equivalent: the problems of representing illegal *characters* in an XML document is closely analogous to the problem of representing illegal *bytes* in a character string. > The point I'm making is that *whatever* you do, you are still > asking for implementers to obey some convention on conversion > failures for corrupt, uninterpretable character data. > My assessment is that you'd have no better success at making > this work universally well with some set of 128 magic bullet > corruption pills on Plane 14 than you have with the > existing Quoted-Unprintable as a convention. It doesn't have to work universally; indeed, it becomes a QOI issue. Allocating representations of bytes with "bits that are high" makes it possible to do something recoverable, at very little expense to the Unicode Consortium. > Further, as it turns out that Lars is actually asking for > "standardizing" corrupt UTF-8, a notion that isn't going to > fly even two feet, I think the whole idea is going to be > a complete non-starter. I agree that that part won't fly, absolutely. -- In politics, obedience and support John Cowan <[EMAIL PROTECTED]> are the same thing. --Hannah Arendthttp://www.ccil.org/~cowan
Re: OpenType not for Open Communication?
Michael Everson scripsit: > >I think it's more accurate to say that you need to find a way to > >compensate the font developer for his effort; this need not involve > >money. > >I, for example, create programs and give them to people for a reward I > >consider sufficient; professionally, I write bespoke software which is > >not useful to anyone but my employer; some 90% of all software written > >is in this class. > > Read: "my employer pays me to make software". Yes, of course; I thought that was implicit. My point is that not all the software made is paid for, not by a long chalk, and most of what is made is not sold to anyone. Neither of these things is true of fonts in any but the most trivial ways -- yet. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, LOTR:FOTR
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler scripsit: > Storage of UNIX filenames on Windows databases, for example, > can be done with BINARY fields, which correctly capture the > identity of them as what they are: an unconvertible array of > byte values, not a convertible string in some particular > code page. This solution, however, is overkill, in the same way that it would be overkill to encode all 8-bit strings in XML using Base-64 just because some of them may contain control characters that are illegal in well-formed XML. > In my opinion, trying to do that with a set of encoded characters > (these 128 or something else) is *less* likely to solve the > problem than using some visible markup convention instead. The trouble with the visible markup, or even the PUA, is that "well-formed filenames", those which are interpretable as UTF-8 text, must also be encoded so as to be sure any markup or PUA that naturally appears in the filename is escaped properly. This is essentially the Quoted-Printable encoding, which is quite rightly known to those stuck with it as "Quoted-Unprintable". > Simply > encoding 128 characters in the Unicode Standard ostensibly to > serve this purpose is no guarantee whatsoever that anyone would > actually implement and support them in the universal way you > envision, any more than they might a "=93", "=94" convention. Why not, when it's so easy to do so? And they'd be *there*, reserved, unassignable for actual character encoding. Plane E would be a plausible location. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, LOTR:FOTR
Re: Word dividers, was: proposals I wrote (and also, didn't write)
Peter Kirk scripsit: > I notice that Elaine is here proposing a HEBREW SAMARITAN PUNCTUATION > WORD DIVIDER - and this should be in the BMP as Samaritan is a script in > modern list. But there is already in the pipeline a PHOENICIAN WORD > SEPARATOR, provisionally U+1091F, and already defined U+10101 AEGEAN > WORD SEPARATOR DOT, and also of course U+00B7 MIDDLE DOT. The glyphs for > all of these seem indistinguishable, and so are the functions. The only > difference seems to be the scripts they are associated with, but > punctuation marks are supposed to be not tied to individual scripts. Well, some are and some aren't. Arabic ? is definitely tied to Arabic, for example. As usual, Unicode is empirical rather than rational. In any case, MIDDLE DOT, despite its official classification as punctuation, requires special treatment because of its use in Catalan orthography as effectively a modifier letter, so it is not useful to unify it with anything else. (It is already canonically equivalent to GREEK ANO TELEIA, which is regrettable.) > Is there really a need for so many almost identical word divider dots? Probably not. We already have gobs of dots. It's one of those things: on the other hand, Unicode unifies all the Indic dandas, for example. -- But you, Wormtongue, you have done what you could for your true master. Some reward you have earned at least. Yet Saruman is apt to overlook his bargains. I should advise you to go quickly and remind him, lest he forget your faithful service. --Gandalf John Cowan <[EMAIL PROTECTED]>
Re: OpenType not for Open Communication?
John Hudson scripsit: > OpenType is a trademark of Microsoft and a proprietary font format > jointly developed by Microsoft and Adobe. The question is, is it an open standard? That is, is anyone free to create OpenType fonts, OpenType font tools, OpenType font renderers? Is the documentation freely available at no more than nominal cost? > Unicode is a text encoding standard. Fonts and other software implement > the standard. The 'openness' of the standard doesn't imply anything about > the 'openness' of the software. Indeed. > Font developers are under no obligation to provide you with free fonts. > Do you not charge for your work? If you want fonts to be freely > available, you have to find some way to pay for their development, I think it's more accurate to say that you need to find a way to compensate the font developer for his effort; this need not involve money. I, for example, create programs and give them to people for a reward I consider sufficient; professionally, I write bespoke software which is not useful to anyone but my employer; some 90% of all software written is in this class. (Are there bespoke fonts which the buyer keeps to himself?) -- Using RELAX NG compact syntax toJohn Cowan develop schemas is one of the simplehttp://www.reutershealth.com pleasures in life http://www.ccil.org/~cowan --Jeni Tennison <[EMAIL PROTECTED]>
Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Doug Ewell scripsit: > > Now suppose you have a UNIX filesystem, containing filenames in a > > legacy encoding (possibly even more than one). If one wants to switch > > to UTF-8 filenames, what is one supposed to do? Convert all filenames > > to UTF-8? > > Well, yes. Doesn't the file system dictate what encoding it uses for > file names? How would it interpret file names with "unknown" characters > from a legacy encoding? How would they be handled in a directory > search? Windows filesystems do know what encoding they use. But a filename on a Unix(oid) file system is a mere sequence of octets, of which only 00 and 2F are interpreted. (Filenames containing 20, and especially 0A, are annoying to handle with standard tools, but not illegal.) How these octet sequences are translated to characters, if at all, is no concern of the file system's. Some higher-level tools, such as directory listers and shells, have hardwired assumptions, others have changeable assumptions, but all are assumptions. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan No man is an island, entire of itself; every man is a piece of the continent, a part of the main. If a clod be washed away by the sea, Europe is the less, as well as if a promontory were, as well as if a manor of thy friends or of thine own were: any man's death diminishes me, because I am involved in mankind, and therefore never send to know for whom the bell tolls; it tolls for thee. --John Donne
Re: current version of unicode-font
Paul Hastings scripsit: > speaking of which, *are* there any open source fonts that come even > close to Arial Unicode MS? In what, breadth of coverage or aesthetics? The GNU Unifont has very wide coverage though it is a bitmap font; James Kass's CODE 2000 and CODE 2001 probably have the widest coverage of any font, though it costs US$5 to use them. Both of them IMHO are a tad on the ugly side. Googling for "free Unicode fonts" (no quotes) is useful. -- One Word to write them all, John Cowan <[EMAIL PROTECTED]> One Access to find them, http://www.reutershealth.com One Excel to count them all,http://www.ccil.org/~cowan And thus to Windows bind them.--Mike Champion
Re: Relationship between Unicode and 10646
Peter Kirk scripsit: > >There are a number of people, yourself included, who are actively, > >either maliciously or from ignorance, misrepresenting the relationship > >between the UTC and WG2, and of the standardization process, under the > >guise of "innocent" discussion. ... > > I have merely been asking searching questions, partly from ignorance I > agree. If you or anyone else considers that I have been misrepresenting > the relationship, you are free to correct me. Your main misunderstanding seems to be your belief that WG2 is a democratic body; that is, that it makes decisions by majority vote. Decisions are made by explicitly reached consensus, and ballots are an instrument of reaching consensus. Every "no" vote must be accompanied by comments such that, if they were accepted, the "no" vote would be changed to "yes". ("Yes" votes can have comments too.) The result of a "no" vote is that the process loops until all such votes are resolved. Although the UTC does not have a vote as such, being a liaison member, its input is treated as if it were vote comments. If consensus cannot be reached, the proposal is eventually dropped, I suppose. -- Time alone is real John Cowan <[EMAIL PROTECTED]> the rest imaginaryhttp://www.reutershealth.com like a quaternion --phma http://www.ccil.org/~cowan
Re: (base as a combing char)
Philippe Verdy scripsit: > For this reason, Dutch will need a distinct "ij" > letter, coded as a single character, and with its own capitalization rules > (the uppercase or titlecase form of "ij" will be the single letter "IJ", > not two letters and not "Ij"; also there exists cases where diacritics can > be added on top of the "ij" letter, which is then more tied as a single > letter than a simple digraph.) Everything you say is correct *except* for the need to encode Dutch ij as a single character, which is neither necessary nor practical. (U+0132 and U+0133 are encoded for compatibility only.) In cases where ij is a digraph in Dutch text, i+ZWNJ+j will be effective. -- "Kill Gorgïn! Kill orc-folk! John Cowan No other words please Wild Men. [EMAIL PROTECTED] Drive away bad air and darkness http://www.reutershealth.com with bright iron!" --Ghïn-buri-Ghïnhttp://www.ccil.org/~cowan
Re: Relationship between Unicode and 10646
Peter Kirk scripsit: > I don't want to go along with Philippe entirely on this, but surely he > must be right on this last point. Formally, Unicode is effectively the > agent of just one national body in this decision-making process. The Unicode Consortium is not an agent of the USNB, although it is a U.S. corporation. It is itself an international organization, even having some governmental bodies as members (agenciess of the Indian and Pakistani national governments and the Tamil Nadu state government), one intergovernmental organization, one international non-governmental organization, and at least a dozen non-U.S. corporations. > But formally these other bodies do have the right to > outvote Unicode, and in effect to force Unicode to reverse its decisions > - or else to reverse its policy of maintaining compatibility. Formally, yes. However, by acts of self-abnegation, WG2 has a fixed policy of not overriding the UTC or vice versa. > Here in Europe it does not go down well when US bodies claim the right > to make decisions for the whole world, It's a mistake to think of the Consortium as a U.S. body. -- Mark Twain on Cecil Rhodes: John Cowan "I admire him, I freely admit it, http://www.ccil.org/~cowan and when his time comes I shallhttp://www.reutershealth.com buy a piece of the rope for a keepsake." [EMAIL PROTECTED]
Re: Misuse of 8th bit [Was: My Querry]
Antoine Leca scripsit: > In a similar vein, I cannot be in agreement that it could be advisable to > use the 22th, 23th, 32th, 63th, etc., the upper bits of the storage of a > Unicode codepoint. Right now, nobody is seeing any use for them as part of > characters, but history should have learned us we should prevent this kind > of optimisations to occur. No, I don't agree with this part. Unicode just isn't going to expand past 0x10 unless Earth joins the Galactic Empire. So the upper bits are indeed free for private uses. > Particularly when it is NOT defined by the > standards: such a situation leads everybody and his dog to find his > particular "optimum" use for these "free space", and these classes of > optimums do not generally collides between them... I don't think this matters as long as the upper bits are not used in interchange. For example, it would be reasonable to represent Unicode characters as immediates on a virtual machine by using some pattern in the upper bits that flags them as characters. -- Eric Raymond is the Margaret Mead John Cowan of the Open Source movement.[EMAIL PROTECTED] --Bruce Perens, http://www.ccil.org/~cowan some years agohttp://www.reutershealth.com
Re: My Querry
Antoine Leca scripsit: > Sorry, no: there is no requirement to clear it. > You are assuming something about the way data are handled. When you handle > ASCII data using octets, you can perfectly, and conformantly, keep some > other "data" (being parity or whatever) inside the 8th bit; so with even > parity AT SIGN will be managed as 192, without any kind of problem (for > you). Indeed, the DEC PDP-8 stored ASCII data with the high bit always set, for compatibility with the way in which ASR-33 Teletypes generated ASCII. So I grew up thinking of 'A' as 301 (octal). I believe that PR1ME computers did this too. -- But you, Wormtongue, you have done what you could for your true master. Some reward you have earned at least. Yet Saruman is apt to overlook his bargains. I should advise you to go quickly and remind him, lest he forget your faithful service. --Gandalf John Cowan <[EMAIL PROTECTED]>
Re: Unicode HTML, download
Peter Kirk scripsit: > Please read my earlier posting. Of course it does make things rather > difficult that none of my postings ever get approved on a Sunday, > especially when I am trying to correct seriously misleading factual errors. Yr hble Hebrew Moderator attempts to work 24/7, but occasionally the need to sleep or to engage in business (I was at a conference all last week) or family business (a death in a friend's family) interferes with this otherwise laudable goal. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com "In computer science, we stand on each other's feet." --Brian K. Reid
Re: [even more increasingly OT-- into Sunday morning] Re: Unicode HTML, download
Michael (michka) Kaplan scripsit: > > I haven't used M$ IE for many years, though, and my > > memory might be wrong. > > Blinded by the misspelling of the product name, maybe? :-) No, that's just a glyph difference. :-) > See http://msdn.microsoft.com/msdnmag/issues/0700/localize/ and the section > entitled "Choosing Character Sets" for info on what is going on here, > particularly firgures 3 and 4 for info on how to script the behavior for the > UTF-8 case Nice article, though it's obnoxious that the figures will only open in a pop-up window. -- Ambassador Trentino: I've said enough. I'm a man of few words. Rufus T. Firefly: I'm a man of one word: scram! --Duck Soup John Cowan <[EMAIL PROTECTED]>
Re: U+0000 in C strings
Philippe Verdy scripsit: > The "modified UTF-8" encoding is only for use in the serialization of > compiled classes that contain a constant string pool, and through the JNI > interface to C-written modules using the legacy *UTF() APIs that want to > work with C strings. Plus the original point of contention: binary serialization of Strings through the DataInput and DataOutput interfaces. -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] In might the Feanorians / that swore the unforgotten oath brought war into Arvernien / with burning and with broken troth. and Elwing from her fastness dim / then cast her in the waters wide, but like a mew was swiftly borne, / uplifted o'er the roaring tide. --the Earendillinwe
Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)
Doug Ewell scripsit: > Then why do the DataInput and DataOutput interfaces perform this special > conversion? There isn't any mention, on the page whose URL Theodore > originally provided, of compatibility with C strings. Probably because Sun was reusing the format that string literals take in compiled Java classes. The format is as compact as UTF-8 provided your characters are in the range U+0001 to U+, which is true most of the time. Serializing with a 32-bit length would be much bulkier. > If a Java String consists of a count followed by the data, I didn't say that. A Java String in memory contains a count and the data, because it is basically a wrapper around a Java array of characters, and Java arrays contain a count. (Strings, unlike arrays, are immutable in Java.) That does not mean that the count is "followed by" the data in the memory representation, which indeed is up to the JVM -- Java does not prescribe it. > Those are design benefits. I was asking about the ability to represent > text adequately. Strings are not used solely to represent text; they are more general. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Consider the matter of Analytic Philosophy. Dennett and Bennett are well-known. Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett. There is also one Dummett. By their works shall ye know them. However, just as no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly known by his works. Indeed, Bummett does not exist. It is part of the function of this and other e-mail messages, therefore, to do what they can to create him.
Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)
Doug Ewell scripsit: > As soon as you can think of one, let me know. I can think of plenty of > *binary* protocols that require zero bytes, but no *text* protocols. Most languages other than C define a string as a sequence of characters rather than a sequence of non-null characters. The repertoire of characters than can exist in strings usually has a lower bound, but its full magnitude is implementation-specific. In Java, exceptionally, the repertoire is defined by the standard rather than the implementation, and it includes U+. In any case, I can think of no language other than C which does not support strings containing U+ in most implementations. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com "But no living man am I! You look upon a woman. Eowyn I am, Eomund's daughter. You stand between me and my lord and kin. Begone, if you be not deathless. For living or dark undead, I will smite you if you touch him."
Opinions on this Java URL?
Theodore H. Smith scripsit: > I'm just curious about the \0 thing. What problems would having a \0 in > UTF-8 present, that are not presented by having \0 in ASCII? I can't > see any advantage there. AFAICT it was a hack so that arbitrary Java strings could be encoded as C strings; that is, with no 0x00 bytes in them, even when the string contained a U+. This is the format used in Java class files for string constants as well. The important thing is to note that the readUTF and writeUTF methods are *binary* I/O; they are the standard way of serializing strings, just as the standard way of serializing ints is to write them out as a 4-byte big-endian sequence. They simply have nothing to do with character encoding at all. -- He made the Legislature meet at one-horse John Cowan tank-towns out in the alfalfa belt, so that [EMAIL PROTECTED] hardly nobody could get there and most of http://www.reutershealth.com the leaders would stay home and let him go http://www.ccil.org/~cowan to work and do things as he pleased.--Mencken, Declaration of Independence
Re: not font designers?
Elaine Keown scripsit: > >Just of curiosity, how many of you are NOT font > >designers? > > > >And are any of your corpus linguists, text database > >people, or maybe database designers? FWIW, I am none of those things (I've designed a database now and then, but I'm hardly a "database designer"). -- The Imperials are decadent, 300 pound John Cowan <[EMAIL PROTECTED]> free-range chickens (except they have http://www.reutershealth.com teeth, arms instead of wings, and http://www.ccil.org/~cowan dinosaurlike tails).--Elyse Grasso
Re: basic-hebrew RtL-space ?
Doug Ewell scripsit: > I've never understood why writing Hebrew or Arabic left-to-right is > called "visual" order anyway. These are RTL scripts; they are supposed > to be not only written, but also read, right-to-left. Wouldn't a reader > of Hebrew or Arabic consider RTL to BE the "visual" order? Of course. It's sheer ethnocentricism. -- Values of beeta will give rise to dom! John Cowan (5th/6th edition 'mv' said this if you triedhttp://www.ccil.org/~cowan to rename '.' or '..' entries; see [EMAIL PROTECTED] http://cm.bell-labs.com/cm/cs/who/dmr/odd.html)
Re: Public Review Issue: UAX #24 Proposed Update
Peter Kirk scripsit: > >Names are sometimes inaccurate, viz. ZINOR and ZARQA and the infamous > >FHTORA. That doesn't change the meaning or utility of the character. > > Agreed. It simply changes, indeed destroys completely, the utility of > the character name. Not at all. As I've told you before (and you agreed before), it's just as much a fallacy to suppose that Unicode character names carry no information as to suppose that they carry complete information. The truth is somewhere between: most names are helpful, a few names are partially misleading (but not totally so). As for FHTORA, it's annoying, but I don't see how it can be read as anything but FTHORA if you know anything about Greek at all, which is probably why it was overlooked until it was too late. -- You escaped them by the will-death John Cowan and the Way of the Black Wheel. [EMAIL PROTECTED] I could not. --Great-Souled Samhttp://www.ccil.org/~cowan
Re: Public Review Issue: UAX #24 Proposed Update
Andrew C. West scripsit: > "In principle when a character of a given script is used in more than one > language, no language name is specified. Exceptions are tolerated where an > ambiguity would otherwise result." [N2652R Annex L Rule 9] Indeed, but this begs the question of whether the characters in question are indeed unique to Yiddish or not. My other point stands. -- Babies are born as a result of the John Cowan mating between men and women, and most http://www.reutershealth.com men and women enjoy mating. http://www.ccil.org/~cowan --Isaac Asimov in Earth: Our Crowded Spaceship [EMAIL PROTECTED]
Re: Public Review Issue: UAX #24 Proposed Update
Jony Rosenne scripsit: > The UTC refused to add Yiddish to the name, unlike the other Yiddish > specialties, and I am not aware of any other possibility. Why should it? Incorporating a language name into a character name, as in ABKHASIAN CHE and KHAKASSIAN CHE, is done because those languages have a letter named CHE distinct from the more usual, cross-linguistic Cyrillic CHE. There is no such contrast in this case: we do not speak of LATIN SMALL LETTER ICELANDIC THORN, for example. -- Some people open all the Windows; John Cowan wise wives welcome the spring [EMAIL PROTECTED] by moving the Unix. http://www.reutershealth.com --ad for Unix Book Units (U.K.) http://www.ccil.org/~cowan (see http://cm.bell-labs.com/cm/cs/who/dmr/unix3image.gif)
Re: Public Review Issue: UAX #24 Proposed Update
Jony Rosenne scripsit: > FB1D, HEBREW LETTER YOD WITH HIRIQ, should be assigned to the unknown group. > It is not a Hebrew character, notwithstanding the misleading name. To anticipate Michael: Of course it is. It's not used in the Hebrew language, perhaps; but the Hebrew script is used for other languages besides Hebrew. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com "Mr. Lane, if you ever wish anything that I can do, all you will have to do will be to send me a telegram asking and it will be done." "Mr. Hearst, if you ever get a telegram from me asking you to do anything, you can put the telegram down as a forgery."
Japanese pitch accent representations
The following links show L-shaped marks, apparently combining characters, that indicate the change-of-pitch position in Japanese words written in romaji. Are these novel characters, or can they be identified with existing Unicode characters? Are they really combining? http://member.newsguy.com/~sakusha/dict/martin-je.html http://member.newsguy.com/~sakusha/dict/kenkyusha-je.html -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Consider the matter of Analytic Philosophy. Dennett and Bennett are well-known. Dennett rarely or never cites Bennett, so Bennett rarely or never cites Dennett. There is also one Dummett. By their works shall ye know them. However, just as no trinities have fourth persons (Zeppo Marx notwithstanding), Bummett is hardly known by his works. Indeed, Bummett does not exist. It is part of the function of this and other e-mail messages, therefore, to do what they can to create him.
Re: MSDN Article, Second Draft
Jungshik Shin scripsit: > As is often the case, Unicode experts are not necessarily experts on > 'legacy' character sets and encodings. The 'official' name of 'ASCII' is > ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, > I'm afraid you're spreading misinformation about what came before it. > The sentence that 'ANSI pushed this scope ... represents 256 characters' > is misleading. ANSI has nothing to do with various single, double, > triple byte character sets that make up single and multibyte character > encodings. They're devised and published by national and international > standard organizations as well as various vendors. Perhaps, you'd better > just get rid of the sentence 'ANSI pushed ... providing backward > compatibility with ASCII'. Like it or not, "ANSI" has two meanings now: the American National Standards Institute and a generic term for an 8-bit Windows codepage. Similarly, "OEM" means both an original equipment manufacturer and an 8-bit PC-DOS codepage. -- "No, John. I want formats that are actually John Cowan useful, rather than over-featured megaliths that http://www.ccil.org/~cowan address all questions by piling on ridiculous http://www.reutershealth.com internal links in forms which are hideously[EMAIL PROTECTED] over-complex." --Simon St. Laurent on xml-dev
Re: Combining across markup?
Anto'nio Martins-Tuva'lkin scripsit: > Even better yet: Have the WC3 rephrase their demand that no element > should start with a defective sequence (when considered in separate) > as that no *block-level* element should etc., and leave things like > , and other in-line elements free to start with a combining > character (provided that the said in-line container is not the first > within a block-level element, of course). The trouble with that idea is that in XML generally we don't know what is a block-level element: elements are just elements, and it's up to rendering routines whether they appear as block, inline, or not at all. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Promises become binding when there is a meeting of the minds and consideration is exchanged. So it was at King's Bench in common law England; so it was under the common law in the American colonies; so it was through more than two centuries of jurisprudence in this country; and so it is today. --Specht v. Netscape
Re: Much better Latin-1 keyboard for Windows
Michael Everson scripsit: > >Interesting. There seems to be no explanation of the seven keyboard > >states shown in the graphic at ga-keys-x.gif. Can you explicate them? > > Hm? The shift, alt, and caps lock keys are shown depressed in the drawings. Ah, that strange glyph is Alt, or rather AltGr, then. I presume the Swedish-church-symbol is functioning as the variety of Alt that makes keyboard accelerators. -- What is the sound of Perl? Is it not the John Cowan sound of a [Ww]all that people have stopped [EMAIL PROTECTED] banging their head against? --Larryhttp://www.ccil.org/~cowan
Re: Much better Latin-1 keyboard for Windows
Michael Everson scripsit: > Please see the specification of the Irish > Extended keyboard for Unicode, at > http://www.evertype.com/celtscript/ga-keys-x.html Interesting. There seems to be no explanation of the seven keyboard states shown in the graphic at ga-keys-x.gif. Can you explicate them? -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan Female celebrity stalker, on a hot morning in Cairo: "Imagine, Colonel Lawrence, ninety-two already!" El Auruns's reply: "Many happy returns of the day!"
Re: http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
Asmus Freytag scripsit: > >Is John Cowan's list supposed to be a complete list of > >foldables for extant Hebrew code points? > > We know its not. It lists all the characters which have points embedded in them. If you map all those characters away and delete all explicit points and accents, you have unpointed Hebrew. -- Verbogeny is one of the pleasurettesJohn Cowan <[EMAIL PROTECTED]> of a creatific thinkerizer. http://www.reutershealth.com -- Peter da Silvahttp://www.ccil.org/~cowan
Re: Folding algorithm and canonical equivalence
Asmus Freytag scripsit: > There are two options for a starting set: > select all 'accents' (note, not baseforms) that occur in some > precomposed character. And then add additional ones on a case by case > basis (e.g. stroke overlay). > > Or, start with all gc=Mn from the 0300 and 1DC0 blocks (the latter will > be part of 4.1), and make some principled additions / deletions. I'd say, start from the combining characters which are in ISO 10646 Level 3, and then add the combining characters from the abjads (Hebrew, Arabic, Syriac). -- XQuery Blueberry DOM John Cowan Entity parser dot-com [EMAIL PROTECTED] Abstract schemata http://www.reutershealth.com XPointer errata http://www.ccil.org/~cowan Infoset Unicode BOM --Richard Tobin
Re: Folding algorithm and canonical equivalence
Peter Kirk scripsit: > Anyway, is Yiddish in fact never written completely unpointed? That > would surprise me. It might have happened at some point, but the standard (YIVO) Yiddish orthography would become illegible if points were stripped. -- Principles. You can't say A is John Cowan <[EMAIL PROTECTED]> made of B or vice versa. All mass http://www.reutershealth.com is interaction. --Richard Feynman http://www.ccil.org/~cowan
Re: Much better Latin-1 keyboard for Windows
Raymond Mercier scripsit: > Jowh Cowan writes Jowh? > Latin-1 is not everything! If you need to transcribe > Arabic/Hebrew/Sanskrit/Farsi, you will need the macrons on vowels (Latin > Extended-A) and various dot-under letters (Latin Extended Additional). I > made my own layout using the DDK. No, it isn't everything, but it's a great deal, especially considering the annoying behavior of the standard US-International keyboard. Why not release your keyboard to the world? -- "You're a brave man! Go and break through the John Cowan lines, and remember while you're out there [EMAIL PROTECTED] risking life and limb through shot and shell, www.ccil.org/~cowan we'll be in here thinking what a sucker you are!" www.reutershealth.com --Rufus T. Firefly
Much better Latin-1 keyboard for Windows
http://www.livejournal.com/users/gwalla/39856.html is a page about (and a link to) a truly excellent Windows keyboard driver that provides full access to the Latin-1 range but is completely compatible with the US-ASCII keyboard except for AltGr (the right Alt key). All non-ASCII characters and dead keys are available there: for example, to get à, one types AltGr-` followed by a. I can't recommend this too much; I immediately dropped both the US-ASCII and US-International keyboards, which I have been using in alternation. The only (very minor) problem with it is that for some reason it messes up Ctrl-Shift and Ctrl-nonletter key combinations. -- "Well, I'm back." --SamJohn Cowan <[EMAIL PROTECTED]>
Re: Folding algorithm and canonical equivalence
Asmus Freytag scripsit: > John, you proposed the initial set. Do you have any suggestion here? My original submission had only the single-character mappings, not the character pair mappings, which are just the result of decomposing the precomposed set and don't IMHO make much sense: they are too selective. The list predates TR#30; I developed it for the purpose of making NFC Latin text minimally legible on an old ASCII-only printer. (I simply changed the filtering regex from "LATIN" to "LATIN|GREEK|CYRILLIC|HEBREW".) It was not intended to cope with partially or fully decomposed text. I agree that in the TR#30 context, the Right Thing is to remove the character pair mappings altogether, and all of the single-character mappings that have canonical decompositions. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan I come from under the hill, and under the hills and over the hills my paths led. And through the air. I am he that walks unseen. I am the clue-finder, the web-cutter, the stinging fly. I was chosen for the lucky number. --Bilbo
Re: Folding algorithm and canonical equivalence
Peter Kirk scripsit: > But I think the best thing to do is to drop *all* Hebrew > combining marks; the result of this is valid unpointed Hebrew. I agree. -- Schlingt dreifach einen Kreis vom dies! John Cowan <[EMAIL PROTECTED]> Schliesst euer Aug vor heiliger Schau, http://www.reutershealth.com Denn er genoss vom Honig-Tau, http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
Re: Umlaut and Tréma, was: Variation selectors and vowel marks
Doug Ewell scripsit: > CGJ + COMBINING DIAERESIS is a hack, but then again the need to draw a > distinction between the exact same combining mark used for two different > phonetic purposes is a bit of a hack too. However, there used to be typographical distinctions in certain German fonts between umlaut and diaeresis: see the examples on p. 15 of Victor Gaultney's paper "Problems of diacritic design for Latin script text faces" at http://www.sil.org/~gaultney/ProbsOfDiacDesignLowRes.pdf (warning: 1.4M), particularly Figure 39. > The alternative proposed by DIN, creating a new COMBINING UMLAUT > character, would have caused *unprecedented and catastrophic* > equivalence and normalization problems. Indeed. -- "Take two turkeys, one goose, four John Cowan cabbages, but no duck, and mix them http://www.ccil.org/~cowan together. After one taste, you'll duck [EMAIL PROTECTED] soup the rest of your life."http://www.reutershealth.com --Groucho
Re: Looking for transcription or transliteration standards latin- >arabic
Peter Kirk scripsit: > I have just reviewed this list and found it odd that Hebrew presentation > forms are included but Arabic ones are not. The specification actually called only for Latin, Greek, and Cyrillic; I added Hebrew pour la lagniappe. If someone wants to add Arabic, I encourage them to do so. > the Hebrew presentation forms but also most of the precomposed > characters are redundant in this list. True; however, the current list indicates the scope of what actually happens, even if it is overlong. > It is therefore > necessary to list in the specification of the folding only all (?) > combining marks, which are to be deleted, I believe that all Mn-class characters, and only they, are deleted by this. > I note that 0429 is not folded to 0428 etc, and this is correct because > within the Cyrillic writing system these are entirely separate > characters. But the difference between these two is in fact exactly the > same descender which is removed in 0496 etc. I don't think that matters. Long historical practice has made SHCHA a separate letter, just as G, J, U, and W are now separate Latin letters from C, I, V, and VV-ligature. > I am also surprised to note > that no folding is given for 0419/0439; although in some ways this is > desirable because Russians do not consider this breve to be a diacritic > (and after all we would not want the dot on i to be removed as a > diacritic!), these characters have canonical decompositions to 0418/0438 > and breve and the principle of canonical equivalence and the folding > algorithm (which works on decomposed characters) more or less demand > that the breve be deleted. Also 048A/048B should then fold to 0418/0438 > rather than 0419/0439. I think I agree with this: i-breve does not have the same universal status as shch. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] 'Tis the Linux rebellion / Let coders take their place, The Linux-nationale / Shall Microsoft outpace, We can write better programs / Our CPUs won't stall, So raise the penguin banner of / The Linux-nationale.
Re: Looking for transcription or transliteration standards latin- >arabic
Jony Rosenne scripsit: > I doubt it makes much sense to the casual reader. Witness how nearly every > radio and television pronounces New Delhi as New Del-hi. O pity the poor poor Zippity, For he can eat nothing but Greli, A plant that grows only In New Caledony, While the Zippity lives in New Delhi. --Shel Silverstein -- "Take two turkeys, one goose, four John Cowan cabbages, but no duck, and mix them http://www.ccil.org/~cowan together. After one taste, you'll duck [EMAIL PROTECTED] soup the rest of your life."http://www.reutershealth.com --Groucho
Re: Looking for transcription or transliteration standards latin- >arabic
Patrick Andries scripsit: > >So the change is more like Beijing -> Peking than Berlin -> Kitchener. > > Without a political change Constantinople would not have changed name in > a matter of days (at least as far as the officials were concerned). In > any case, it is not a transliteration problem (Beijing --> Pékin). Not just a transliteration problem, either: Mandarin Chinese underwent a sound-shift in the 17th century that changed the second syllable from "ging" to "jing", but the English name was already set (and the change did not affect Southern Sinitic in any case; cf. Cantonese "pak king"). In addition, when it isn't the capital (bei jing = "North-capital"), i.e. 1928-49, its name is Beiping ("north-peace"). -- Here lies the Christian,John Cowan judge, and poet Peter, http://www.reutershealth.com Who broke the laws of God http://www.ccil.org/~cowan and man and metre. [EMAIL PROTECTED]
Re: Looking for transcription or transliteration standards latin- >arabic
Peter Kirk scripsit: > Well, did Gdansk/Danzig change its name backwards and forwards several > times over history (thank you, Qrczak, for the interesting information > about that), or was it simply that it had different names in different > languages? Yes to both. Its name in Polish is Gdan'sk, in German Danzig. Which one is the dominant name is determined by which power is dominant at a given time. What foreigners call the city is influenced, though not determined, by when the city first became important to them. There is hardly a city in Europe that isn't like this. What makes this one special, though hardly unique, is the repeated changes of sovereignty. Consider Strassburg/Strasbourg. > This makes it not a transliteration problem but a translation > problem, one which is common to many geographical names - sometimes the > names in different languages are related, and sometimes they are not > e.g. Turku/Åbo in Finland, or Yerushalayim/al-Quds, or Dublin/(I'll let > Michael tell us the correct Irish form). Baile Atha Cliath. Dublin is also an Irish name, though used mostly by Norse and English (and now by anglophone Irish, of course). -- My confusion is rapidly waxing John Cowan For XML Schema's too taxing:[EMAIL PROTECTED] I'd use DTDshttp://www.reutershealth.com If they had local trees -- http://www.ccil.org/~cowan I think I best switch to RELAX NG.
Re: is "n with tilde" used in French language ?
Stefan Persson scripsit: > I have only seen ñ in old French; however, old French also uses tilde > above lots of other characters, such as all vowels (ã?õ?) and a > lot of consonants, e.g. q?? (for the old spelling of "que"). Instead of > writing an "n", you often put a tilde over the letter preceding the "n". > So e.g. "France" was "Frãce." I believe that this spelling was used > until about the time of the French revolution. In origin the tilde *was* a degenerate "n", of course. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan http://www.reutershealth.com Thor Heyerdahl recounts his attempt to prove Rudyard Kipling's theory that the mongoose first came to India on a raft from Polynesia. --blurb for Rikki-Kon-Tiki-Tavi
Re: Looking for transcription or transliteration standards latin- >arabic
Philipp Reichmuth scripsit: > "Chykoffskee" is pretty accurate, actually. Thank you. I have long since forgotten all the (very small amount of) Russian I ever learned, but I retain a firm grip on its phonology due to an interesting paedagogical device. My Russian instructor spent the first week or so of class teaching us to speak English with a Russian accent (and this I can do to this day). The idea was that having mastered this, we could then begin to speak Russian as well with a Russian accent, which is to say, perfectly. > I'd say Tchaikovsky is just > a spelling taken over from French at a time when French was pretty much > the international common language at least in diplomacy and art. Doubtless. I have even seen it spelled in German fashion in English a time or two. -- I suggest you call for help, John Cowan or learn the difficult art of mud-breathing.[EMAIL PROTECTED] --Great-Souled Sam http://www.ccil.org/~cowan
Re: Looking for transcription or transliteration standards latin- >arabic
Doug Ewell scripsit: > On the contrary, untransliterated (or untranscribed) text can only be > read by people who know the original script. Transliterations and > transcriptions at least give the Latin-script-only reader a fighting > chance to pronounce the text. Transliterations don't work so well for that, but transliterating some scripts to Latin is a necessity (for me, at least) to even recognize them. I can cope with Greek, Hebrew, and Cyrillic, but an English text full of Arabic or Chinese names presented in the usual scripts for those languages would be hopeless -- I wouldn't be able to reliably tell one name from another. This is true even though I have no more Greek, Hebrew, or Russian than I have Arabic or Chinese. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com "If he has seen farther than others, it is because he is standing on a stack of dwarves." --Mike Champion, describing Tim Berners-Lee (adapted)
Re: Looking for transcription or transliteration standards latin- >arabic
Jony Rosenne scripsit: > Transcription does not require roundtrip. It is intended in this case for > the English speaker to be able to deliver an approximate pronunciation > adapted to his native vocal capabilities. Except when it doesn't. We write Tchaikovsky, not Chykoffskee. -- "I could dance with you till the cows John Cowan come home. On second thought, I'd http://www.ccil.org/~cowan rather dance with the cows when you http://www.reutershealth.com came home." --Rufus T. Firefly [EMAIL PROTECTED]
Re: Greek tonos and oxia
Peter Kirk scripsit: > Since the characters are in fact exactly equivalent, you can use > whichever you wish, as long as you are aware that some processes may > change one to the other. They should be rendered identically. True. But the original question was "Which are preferred", and there is a definite answer to that. > But, in favour of using the versions from the Extended Greek sets, > there are a number of fonts around which render the versions in the > main Greek and Coptic block (or has it been officially renamed just > "Greek"?) with a vertical tonos, Quite so. In general, though, we should encode text correctly and then use correct fonts, rather than adjusting our encoding to the vagaries of erroneous or obsolete fonts. Unicode 2.0 fonts also have the problem that they produce the wrong forms for theta and phi in running text. -- "In my last lifetime, John Cowan I believed in reincarnation;http://www.ccil.org/~cowan in this lifetime, [EMAIL PROTECTED] I don't." --Thiagi http://www.reutershealth.com
Re: Greek tonos and oxia
Michael Everson scripsit: > At 14:11 -0400 2004-06-30, John Cowan wrote: > > >But the X WITH ACUTE characters there are exactly equivalent to the > >X WITH TONOS characters in the main Greek block, and the ones in the > >main Greek block are in fact preferred. > > How can you tell they are preferred, John? Because normalization changes the latter to the former, as a result of the one-way nature of (singleton) compatibility equivalence. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan Does anybody want any flotsam? / I've gotsam. Does anybody want any jetsam? / I can getsam. --Ogden Nash, No Doctors Today, Thank You
Re: Greek tonos and oxia
Peter Kirk scripsit: > If you prefer to use precomposed characters (rather than separate > diacritics as Ken suggested) or need to do so to meet W3C > recommendations, you should use the ones in the Extended Greek section, > which allow for a distinction between acute and grave accents which is > important for Classical Greek. Many of the characters in the Extended Greek block are indeed essential to polytonic Greek. But the X WITH ACUTE characters there are exactly equivalent to the X WITH TONOS characters in the main Greek block, and the ones in the main Greek block are in fact preferred. This can be determined by looking at the normalization rules, which will change all X WITH ACUTE characters to the corresponding X WITH TONOS characters. > You may like to look at Nick Nicholas' Greek Unicode site at > http://ptolemy.tlg.uci.edu/~opoudjis/unicode/unicode.html, which > discusses these issues. Indeed. -- "In my last lifetime, John Cowan I believed in reincarnation;http://www.ccil.org/~cowan in this lifetime, [EMAIL PROTECTED] I don't." --Thiagi http://www.reutershealth.com
Re: decent unicode capable web app editor
Edward H. Trager scripsit: > What about vim (vi clone: http://www.vim.org). I just converted > a very large UTF-8-encoded HTML document (more than 15000 > lines) to standards-compliant XHTML-1.0 and found the advanced > regular-expression-based substitution facilities in vi(m) absolutely > indispensible for adding all of the closing tags that XML requires > which were missing in my original document. HTML Tidy or TagSoup would probably have served you better, rather than groveling over the code bit by bit. (HTML Tidy can do more cleaning, but it sometimes loops or delivers garbage if the HTML is sufficiently broken. TagSoup never gives up and never loops, but doesn't know as much about HTML.) -- Said Agatha Christie / To E. Philips Oppenheim John Cowan "Who is this Hemingway? / Who is this Proust? [EMAIL PROTECTED] Who is this Vladimir / Whatchamacallum, http://www.reutershealth.com This neopostrealist / Rabble?" she groused. http://www.ccil.org/cowan --author unknown to me; any suggestions?
Re: Bantu click letters
Michael Everson scripsit: > Unless one contacted whomever it is who owns "Bantu Studies" and > simply *asked*. Carfax (part of the Taylor and Francis Group). Here's contact information: Reprints, permissions + electronic rights Joanne Nerland Taylor & Francis PO Box 2562 Solli N-0202 Oslo Norway + 47 22 12 9880 or: +47 22 12 9884 Mobile: +47 90 11 3974 +47 22 12 9890 But Gutenberg may not care: they mostly (now exclusively?) publish texts in the public domain. -- John Cowanhttp://www.ccil.org/~cowan [EMAIL PROTECTED] Please leave your valuesCheck your assumptions. In fact, at the front desk. check your assumptions at the door. --sign in Paris hotel --Cordelia Vorkosigan
Re: Bantu click letters
Michael Everson scripsit: > > > Effort and expense was made to cut the letters for the publication. > > > >And today, if I were reprinting it, I'd commission a digital font > >(your effort, my expense) and put the characters in the PUA. > > Not if you wanted, as an Africanist, to be able to represent the text > as it was originally written. We must be talking past one another somehow, but I don't understand how. To represent the text as originally written, I need a digital representation for each of the characters in it. Since all I want to do is reprint the book -- I don't need to use the unusual characters in interchange -- the PUA and a commissioned font seem just perfect to me. > You don't know whether or not they were only used in a single > document. You know only that I *own* that single document. You are > declaring the characters guilty until proved innocent. That's > antagonistic. I intend no antagonism. We treat the Phaistos-disk characters as guilty until proven innocent, for the same reason -- there's only one text. (It's also true that we can't interpret them, which is additional evidence against them.) There's no *point* in encoding the PD characters because they aren't used in interchange -- see above. > >If I decided to start using thorn instead of theta in my otherwise > >IPA transcriptions, that would be an idiosyncratic use of it. > > Plenty of Germanist transcriptions use thorn. In any case, the > analogy isn't relevant, as both thorn and theta are encoded and > available for use. I was talking about what it means to be idiosyncratic. (Not that either of us need any real instruction on the subject!) > >(LATIN LETTER OWL, indeed.) > > COMBINING SEAGULL BELOW, indeed. LATIN LETTER OI, indeed. :-) > [OWL] is interesting, by the way. Asmus says it's similar to > something the Japanese use for telephone answering machines. I don't > know about that, though it looks familar to me. I wonder what Doke's > source for it was. It looks to me the sort of thing that would be easy to reinvent. Some of my habitual doodles are much like it. > I was astonished because I hadn't seen them before. That does not > mean I didn't believe that they weren't worthy of encoding. Just > because I hadn't seen them before doesn't mean they don't exist and > aren't worthy of encoding either. Khoisian phonology is rather > esoteric, after all. Sure. I was addressing the question of the *novelty* of the characters. If neither you nor I nor anyone else in this community has seen them before, they are most certainly novel. > I am gobsmacked. On what grounds are these not characters? They are > not glyph representations of other characters. They *are* characters. It's just not useful to encode them, any more than it's useful to encode most of the scripts in the Conscript Registry. Find more documents, and the picture changes. (Find more Phaistos-type disks, and that picture changes too.) -- If you have ever wondered if you are in hell, John Cowan it has been said, then you are on a well-traveled http://www.ccil.org/~cowan road of spiritual inquiry. If you are absolutely http://www.reutershealth.com sure you are in hell, however, then you must be [EMAIL PROTECTED] on the Cross Bronx Expressway. --Alan Feuer, NYTimes, 2002-09-20
Re: Bantu click letters
Michael Everson scripsit: > Although Pullum and Ladusaw > don't show the glyphs, they refer specifically to Doke's characters > (s.v. ///). They describe them as "ad hoc" which I suppose the were, > in 1925, though "novel" would do as well as they aren't entirely > arbitrary and they weren't "found" bits of lead type pressed into > other service -- they were cut to order. If Sequoyah had had clout, we'd probably be using his original characters for Cherokee today. > That Pullum and Ladusaw have not forgotten Doke's characters suggests > that Africanists will also likely not forget them, and will find use > in access to them as encoded characters in the UCS. It's P&L's business to remember what would otherwise be (mercifully, in some cases) forgotten, so that people who need to interpret old documents have some hope of doing so. What we need is more evidence: either documentary evidence, or the evidence of breathing Africanists. -- John Cowan <[EMAIL PROTECTED]> http://www.ccil.org/~cowan http://www.reutershealth.com Charles li reis, nostre emperesdre magnes, Set anz totz pleinz ad ested in Espagnes.
Re: Bantu click letters
Michael Everson scripsit: > They were published in Bantu Studies in 1925 in an article by a > rather important scholar in the field of African linguistics. We don't encode characters according to the clout of the user, or the Apple logo would have been in Unicode long since. :-) > Effort and expense was made to cut the letters for the publication. And today, if I were reprinting it, I'd commission a digital font (your effort, my expense) and put the characters in the PUA. > The sounds they represent are idiosyncratic and difficult to > describe, much less write. I think that characters used in a single document by a single scholar, however prestigious, can fairly be described as idiosyncratic to him. If I decided to start using thorn instead of theta in my otherwise IPA transcriptions, that would be an idiosyncratic use of it. If instead I used OVERCLOCKED HOOCHIMADINGER SYMBOL, that would be even more idiosyncratic. (LATIN LETTER OWL, indeed.) > Personal? No: he published. Fair enough. > Novel? Perhaps > (in 1925); Doke is likely to have devised them. They are just as novel today as they were eighty years ago; I well remember how astonished you and I were, looking over the text. > Private use? Be > serious, John. That's a pretty ridiculous suggestion. I am serious. The PUA is the proper place for these things. -- "May the hair on your toes never fall out!" John Cowan --Thorin Oakenshield (to Bilbo) [EMAIL PROTECTED]
Re: Bantu click letters
Michael Everson scripsit: > Proposal to add Bantu phonetic click characters to the UCS > http://www.evertype.com/standards/iso10646/pdf/n2790-clicks.pdf [T]he Unicode Standard does not encode idiosyncratic, personal, novel, or private use characters [...]. Whatever may have been done in the past, I don't think that one document is enough to support the introduction of new Latin letters; these look extremely idiosyncratic, personal, novel and private use to me. -- All Norstrilians knew what laughter was: John Cowan it was "pleasurable corrigible malfunction".http://www.reutershealth.com --Cordwainer Smith, Norstrilia [EMAIL PROTECTED]
Re: Revised Phoenician proposal
Peter Constable scripsit: > > > In that sense, treating Phoenician as a script variant of Hebrew > > > is a big win for many of the users of the script, since they > > > would have a hard time deciphering the bizarre (to them) script > > > variant but have no problem reading texts originally written in > > > it in different fonts. > > I didn't understand that statement the first time round, and still am > not sure I understand it. (The antecedent for the last occurrence of > "it" isn't clear to me, so I'm having difficulty interpreting the whole > thing, apart from the matter of whether the point makes sense.) I interpret it to mean that if you know Hebrew, you can read text in Old Hebrew or Phoenician or whatever, provided you can get past the script barrier. For such people, there is some advantage in encoding these old texts with Hebrew characters, since a simple font change will convert between the authentic and the intelligible. By the same token, there would be some advantage to Croats wishing to read Serbian if it's encoded in an encoding that can be rendered with either Latin or Cyrillic letters (or digraphs); such a thing could easily be constructed and mapped to Unicode, thanks to the Croat-specific digraph compatibility characters present. That wouldn't make such an encoding a Good Thing in the wider world, though. -- He made the Legislature meet at one-horse John Cowan tank-towns out in the alfalfa belt, so that [EMAIL PROTECTED] hardly nobody could get there and most of http://www.reutershealth.com the leaders would stay home and let him go http://www.ccil.org/~cowan to work and do things as he pleased.--Mencken, Declaration of Independence
OT: John Cowan announces the "Unix Power Classic"
My apologies for this cross-post, but I just can't tell which of you will be interested in my latest effort: the Unix Power Classic, an evolving hacker-oriented version of the Tao Te Ching. See http://www.ccil.org/~cowan/upc . Please don't reply on-list, but directly to [EMAIL PROTECTED] Thanks. -- Schlingt dreifach einen Kreis vom dies!John Cowan <[EMAIL PROTECTED]> Schliesst euer Aug vor heiliger Schau, http://www.reutershealth.com Denn er genoss vom Honig-Tau, http://www.ccil.org/~cowan Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)
Re: Proposal to encode dominoes and other game symbols
Andrew C. West scripsit: > And perhaps Michael would be kind enough to prepare a proposal for > traffic signs if you asked nicely ;) Yes, but it will be only four lines long. :-) > H.7 Some criteria weaken the case for encoding A few of these criteria seem a bit flaky to me. > There is evidence that > -- the symbol is primarily used free-standing (traffic signs) > -- the notational system is not widely used on computers (dance notation, > traffic signs) So it looks like there are at least two reasons to shoot down traffic signs already. OTOH, lots of things are not widely used on computers precisely because they have no standard representation (minority scripts being the obvious case.) > -- the symbol is purely decorative This would seem to exclude dingbats altogether. > -- the identity of the symbol is usually ignored in processing Eh? > H.10 Perceived usefulness > The fact that a symbol merely seems to be useful or potentially > useful is precisely not a reason to code it. Demonstrated usage, or > demonstrated demand, on the other hand, does constitute a good reason > to encode the symbol. Amen. -- The Imperials are decadent, 300 pound John Cowan <[EMAIL PROTECTED]> free-range chickens (except they have http://www.reutershealth.com teeth, arms instead of wings andhttp://www.ccil.org/~cowan dinosaurlike tails).--Elyse Grasso
Phoenician character properties
If the Phoenician numbers work like Arabic digits (except for not being positional decimal, of course), shouldn't they have bidi type AN? Is strong RTLness really required for PHOENICIAN WORD SEPARATOR? If not, it can be unified with MIDDLE DOT. -- "Do I contradict myself? John Cowan Very well then, I contradict myself.[EMAIL PROTECTED] I am large, I contain multitudes. http://www.ccil.org/~cowan --Walt Whitman, Leaves of Grass http://www.reutershealth.com
Re: [BULK] - Re: Vertical BIDI
Mark Davis scripsit: > > As things now stand, Ogham must be wrapped in RLO...PDF brackets when > > mixed with vertical Han or Mongolian. > > Yes, that's true -- and I don't see any reason why people can't live with > that... Those are the kinds of reasons we have the explicit controls. Because horizontal vs. vertical should be under the control of a higher-level protocol such as CSS. In order for this to work properly, CSS has to "reach down inside" the bidi algorithm and jimmy it; or, to put it another way, the character-level representation of Ogham has to have knowledge of what overall directionality it is going to be imbedded in. Both are Bad Things. -- There are three kinds of people in the world: John Cowan those who can count,http://www.reutershealth.com and those who can't.[EMAIL PROTECTED]
An annoying ambiguity about which nothing can be done now
The phrase "COMBINING DOUBLE" in a Unicode character can mean either that the diacritical mark is doubled with respect to some other mark (DOUBLE ACUTE, DOUBLE VERTICAL LINE ABOVE, DOUBLE GRAVE, DOUBLE LOW LINE, DOUBLE OVERLINE, DOUBLE VERTICAL LINE BELOW, DOUBLE VERTICAL STROKE OVERLAY) or else that it extends over two characters (DOUBLE BREVE, DOUBLE MACRON, DOUBLE MACRON BELOW, DOUBLE TILDE, DOUBLE INVERTED BREVE, DOUBLE RIGHTWARDS ARROW BELOW). Of coure MUSICAL SYMBOL COMBINING DOUBLE TONGUE is something else again. Thank you. I feel much better now. -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] There are books that are at once excellent and boring. Those that at once leap to the mind are Thoreau's Walden, Emerson's Essays, George Eliot's Adam Bede, and Landor's Dialogues. --Somerset Maugham
Re: [BULK] - Re: Vertical BIDI
Mark Davis scripsit: > What the Bidi Algorithm says both of these is at: > > http://www.unicode.org/reports/tr9/#Vertical_Text However, it does not specify the treatment of Ogham embedded in TTB text, since Ogham is the only script with both a required horizontal direction (LTR) and a required vertical one (BTT). As things now stand, Ogham must be wrapped in RLO...PDF brackets when mixed with vertical Han or Mongolian. -- One art / There is John Cowan <[EMAIL PROTECTED]> No less / No more http://www.reutershealth.com All things / To do http://www.ccil.org/~cowan With sparks / Galore -- Douglas Hofstadter
Re: Phoenician, Fraktur etc
Mark E. Shoulson scripsit: > Just for some more confusion to add, I note that with the distaste later > Pharisaic Judaism had for the Old Hebrew script, there comes a fairly > well-accepted, if unsupportable, thesis that the Law was actually > *originally* given in Square Hebrew ("Assyrian Script"), which was then > changed/forgotten when Israel sinned, and later still restored. See > http://www.sacred-texts.com/jud/t08/t0805.htm for some Talmudic > discussion of the matter. It's interesting, though, that the story given first is the true one: first Hebrew with PH script, then Aramaic with Square script; finally the Jews wind up with the law in Square Hebrew and the Samaritans with the law in Aramaic using PH script. -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] "'My young friend, if you do not now, immediately and instantly, pull as hard as ever you can, it is my opinion that your acquaintance in the large-pattern leather ulster' (and by this he meant the Crocodile) 'will jerk you into yonder limpid stream before you can say Jack Robinson.'" --the Bi-Coloured-Python-Rock-Snake
Re: Character Foldings
Mark Davis scripsit: > >LATIN CAPITAL LETTER O WITH STROKE => O + / That road quickly takes you to RFC 1345; I have some code for cracking 1345's format, and I should be able to prepare a mapping table fairly soon. -- My confusion is rapidly waxing John Cowan For XML Schema's too taxing:[EMAIL PROTECTED] I'd use DTDshttp://www.reutershealth.com If they had local trees -- http://www.ccil.org/~cowan I think I best switch to RELAX NG.
Re: Proposal to encode dominoes and other game symbols
Eric Muller scripsit: > >A suggestion for playaing cards: why not including the "Tarots"? > >I mean in French the 4 "Cavaliers" figures, the 18 "Atouts", and the > >"Excuse" > >(which is not exactly a Joker); sorry I don't have their English names. > > > Make that 21 atouts (labeled "1" through "21"), for a total of 78 cards. > The "cavalier" is between the jack and the queen. Very popular game in > high school and college in my days. "Trumps" in English. I suggest that 21 trumps be encoded, but not named, because the correspondence of names to numbers is variable. Suitable names would be PLAYING CARD TRUMP I through PLAYING CARD TRUMP XXI. The Fool, the 22nd or un-numbered trump, is the direct ancestor of the Joker and should be unified with it. The fourth court card in each suit is called the Knight in English; these should also be encoded. This would call for 25 more cards. Documentation on the Tarot is extensive and readily available. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] 'Tis the Linux rebellion / Let coders take their place, The Linux-nationale / Shall Microsoft outpace, We can write better programs / Our CPUs won't stall, So raise the penguin banner of / The Linux-nationale.
Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)
Doug Ewell scripsit: > > So is [VIQR] a 7-bit encoding, or a scheme layered on top of ASCII? > > It's a scheme layered on top of ASCII > > And what is KOI-7? > > A true 7-bit encoding for Russian, in which Cyrillic letters (small and > capital respectively) were encoded in the ranges where ASCII has Latin > letters (capital and small respectively). Ah. And on what principle do you distinguish them? The IETF clearly treats them both as charsets, within its definitions. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan "You cannot enter here. Go back to the abyss prepared for you! Go back! Fall into the nothingness that awaits you and your Master. Go!" --Gandalf
Re: VISCII (was: Re: [BULK] - Re: MCW encoding of Hebrew)
Doug Ewell scripsit: > Truye^.n cu?a o^ng la` nhu+~ng bo^. nho+' ghi la.i mo^.t ca'ch so^'ng > ddo^.ng nhu+~ng sinh hoa.t dda(.c bie^.t cu?a no^ng tho^n Vie^.t Nam > ca'ch dda^y nu+?a the^' ky?. Ta ye^u me^'n da^n to^.c ta\. So is this a 7-bit encoding, or a scheme layered on top of ASCII? And what is KOI-7? -- "Clear? Huh! Why a four-year-old childJohn Cowan could understand this report. Run out [EMAIL PROTECTED] and find me a four-year-old child. I http://www.ccil.org/~cowan can't make head or tail out of it." http://www.reutershealth.com --Rufus T. Firefly on government reports
Re: Response to Everson Phoenician and why June 7?
James Kass scripsit: > Well, I don't think it would be cavalier in any sense to use a > transliteration font. Hardly antiquarian or throwback, either. > > But, I don't for a minute think it's the proper thing to do. > I think it would be silly and churlish. I'm more of a ceorl than a chevalier, myself. Strictly foot-bound peasant stock. > those who wish to do so aren't bound by my opinions, eh? The widespread use (as opposed to the mere existence) of a Phoenician encoding in Unicode imposes costs on at least some Semiticists that they do not wish to pay, at least without some assistance from Unicode. Hence my desire to have Phoenician and Hebrew collate together at the first level (more for searching than for sorting). -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] The Penguin shall hunt and devour all that is crufty, gnarly and bogacious; all code which wriggles like spaghetti, or is infested with blighting creatures, or is bound by grave and perilous Licences shall it capture. And in capturing shall it replicate, and in replicating shall it document, and in documentation shall it bring freedom, serenity and most cool froodiness to the earth and all who code therein. --Gospel of Tux
Re: Response to Everson Phoenician and why June 7?
Dean Snyder scripsit: > It would be like testing readers of Roman German who had > never read Fraktur - they wouldn't recognize it as a font change either > (which it is, of course, in Unicode). I see the words "The New York Times" in Fraktur (more or less) every day. It's obviously a font variant of Latin. -- Business before pleasure, if not too bloomering long before. --Nicholas van Rijn John Cowan <[EMAIL PROTECTED]> http://www.ccil.org/~cowan http://www.reutershealth.com
Re: ISO 15924 draft fixes
Philippe Verdy scripsit: > > Please go to Langues'O for this commentary. As I wrote, you will be > > probably answered with the historical context. > > C'est quoi Langues'O ? Où est-ce ? Please forgive me for intruding into an internal francophone matter, but whenever I see "Langues'O", my mind insists on correcting it into "Langues d'O", as in "Histoire d'O". Not that I read French. -- John Cowan [EMAIL PROTECTED]http://www.reutershealth.com "Not to know The Smiths is not to know K.X.U." --K.X.U.
Re: Response to Everson Phoenician and why June 7?
Kenneth Whistler scripsit: > The question is rather, given the fundamental nature of the > Unicode Standard as enabling text processing for modern > software, it is cost-effective and *reasonable* to provide > a Unicode encoding for one particular script or another, > unencoded to date, so as to maximize the chances that it > will be handled more easily by modern software in the global > infrastructure and to minimize the costs associated with > doing so. These words (and indeed your entire posting) deserve to be written up in letters of gold somewhere. -- LEAR: Dost thou call me fool, boy? John Cowan FOOL: All thy other titles http://www.ccil.org/~cowan thou hast given away: [EMAIL PROTECTED] That thou wast born with. http://www.reutershealth.com
Re: Vertical BIDI
Andrew C. West scripsit: > The only thing that is certain is that Ogham must be rendered BTT in > vertical contexts. For Ogham text in isolation this is fairly easy to > accomplish by simple rotation, and one could expect "writing-mode > : bt-rl" or "writing-mode : bt-lr" to accomplish this in a CSS > stylesheet. Whether the columns should run LTR or RTL across the page > is another question, although LTR would be simplest to implement as > it would only involve rotating a whole block of horizontal LTR Ogham > text 90 degrees anticlockwise. At any rate, vertical presentation is > a matter for a higher protocol, and not a Unicode matter. I think it's clear by now that bt-lr is the Right Thing. (A great pity that the Irish monks didn't record horizontal Ogham RTL! If you are standing in front of an Ogham-inscribed archway, the curve of the text does pass from your right side to your left side (and the same for a standing stone if you in imagination flatten out the sides), and the monks must have had *some* familiarity with Hebrew or Arabic.) > However, Ogham text embedded in Mongolian may be a different matter. If > a plain text editor renders everything horizontally, as most do, then > both Mongolian and Ogham should be rendered LTR thus mongolian>, but if you then select vertical presentation (assuming > your text editor has this option) Mongolian should be rendered TTB and > Ogham BTT thus . I still have no idea as > to how this should be achieved. My "hack" of using a custom rotated > Ogham font and RLO/PDF codes would achieve the desired result for > vertical presentation, but would make the Ogham text RTL for horizontal > presentation, which is apparently unacceptable. But what alternatives > are there ? To introduce a concept of bidi override into stylesheet languages. You need something like this anyway to handle the case of lr Latin with embedded Han, where the Latin reads BTT and the Han reads TTB. Fundamentally, vertical scripts like Han and Mongolian and Ogham have an essential vertical directionality and a preferred horizontal one (but they can sometimes tolerate the other direction: RTL Han is not unknown). Horizontal scripts have an essential horizontal directionality and may or may not have a preferred vertical one. -- Long-short-short, long-short-short / Dactyls in dimeter, Verse form with choriambs / (Masculine rhyme): [EMAIL PROTECTED] One sentence (two stanzas) / Hexasyllabically http://www.reutershealth.com Challenges poets who / Don't have the time. --robison who's at texas dot net
Re: Vertical BIDI
Philippe Verdy scripsit: > > In fact no; both Mongolian (or Manchu, which is unified with it in > > Unicode) and Chinese are written TTB. > > Then, why did you say that: > > > What's uncertain is whether a lr or a rl progression is favored, > > given the paucity of evidence. Michael favors lr progression. > > There is no question that the text is read BTT. That statement refers to Ogham, not Mongolian! Ogham carved on stone is read up one side of the stone, then (if necessary) across the top of the stone, then (if necessary) down the other side of the stone. Now maybe it's just a mistake to assimilate this scheme to any kind of two-dimensional layout, since all known instances of Ogham on manuscript are ordinary horizontal L2R, like Latin (with which it is most often mixed). The difficulty arises when Ogham is mixed with vertical Han or with Mongolian, since once the basic directionality becomes vertical, the tendency to read the Ogham BTT will become automatic. This is analogous to the problem that fantasai has pointed out with Latin script written in lr progression when Han gets mixed in: the normal reading direction of lr-Latin is BTT, but any Han included will automatically be read TTB, corrupting it. *sigh* One of my favorite lines in the Unicode Standard reads: "There simply is no traditional Japanese method of typesetting Devanagari." -- John Cowan www.ccil.org/~cowan www.reutershealth.com [EMAIL PROTECTED] There are books that are at once excellent and boring. Those that at once leap to the mind are Thoreau's Walden, Emerson's Essays, George Eliot's Adam Bede, and Landor's Dialogues. --Somerset Maugham
Re: Vertical BIDI
Philippe Verdy scripsit: > This creates an interesting problem: Put in the same sentence Han > (Chinese) and Mongolian words in a vertical layout (I don't think this > is unlikely, as Mongolian is also spoken in China, and there's also > a Chinese community in Mongolia). So Chinese ideographs will be laid > out vertically from top to bottom (but not rotated, except for a few > characters like ideographic punctuation marks or symbols), and Mongolian > will be laid out from bottom to top in their normal stack orientation. In fact no; both Mongolian (or Manchu, which is unified with it in Unicode) and Chinese are written TTB. When Mongolian stands alone, the columns progress from left to right, but when it's mixed with Han, the columns progress from right to left, as is the case with Chinese alone. Presumably this is about like writing a Latin-script language with upright glyphs and LTR, but progressing from the bottom of the page to the top: annoying but legible. > Now admit that you want to present it horizontally: Han ideographs will > not be rotated but will flow on rows from left to right. Suppose you > have performed the Bidi processing according to the previous vertical > presentation, then Mongolian stacks will flow from right to left > (but unlike Han ideographs, they will need to be rotated...) You don't. Horizontal Mongolian runs left to right, which means that with respect to its Aramaic ancestor the glyphs are upside down. Now when mixing Ogham vertical text with other vertical scripts, you do indeed need to use RLO ... PDF to force it into bidirectional behavior, but it's the only such case. There seem to be two different alternatives for RTL horizontal alphabets in a vertical context, depending on which way the glyphs are rotated. -- But you, Wormtongue, you have done what you could for your true master. Some reward you have earned at least. Yet Saruman is apt to overlook his bargains. I should advise you to go quickly and remind him, lest he forget your faithful service. --Gandalf John Cowan <[EMAIL PROTECTED]>
Re: ISO-15924 script nodes and UAX#24 script IDs
Antoine Leca scripsit: > OTOH, it appears to me (feel free to contradict me, and also to to > point me the epoch when these things did change) that English habits > now is to follow the native name and the translitteration rules. True, although diacritics are still sometimes dropped not on principle but as a concession (no longer necessary, IMHO) to typographical constraints. Exactly when this began to be so is vague: sometime in the 20th century, or perhaps the very last of the 19th. Certainly at the beginning of the 19th century anglophones were still respelling foreign words. > A good example I found recently is the name of Cervantes' main work, > which short name is "Don Quixote" in English, the same as it was > in (original) Castilian, while at the same time it was adapted in > French as "Don Quichotte" (same prononciation as original), and > similarly in today's Castilian "Don Quijote" (with subsequent change > in prononciation.) I do not know how English natives will pronounce > it, however. Most people say [kihote], since we do not have the Spanish "j", IPA [x]. (Of course the [o] and [e] vowels become diphthongs, as in most varieties of English.) I personally made a mild nuisance of myself in the class where I studied it by insisting on saying [kiSote]. The derived adjective "quixotic", however, is pronounced in native fashion [kwIksOtIk]. The English poet Byron did not hesitate in his 1821 poem about Don Juan (Tenorio, that is) to rhyme the hero's name with "new one" and "true one" in the very first stanza, showing that the pronunciation [EMAIL PROTECTED] was normal in his time. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Rather than making ill-conceived suggestions for improvement based on uninformed guesses about established conventions in a field of study with which familiarity is limited, it is sometimes better to stick to merely observing the usage and listening to the explanations offered, inserting only questions as needed to fill in gaps in understanding. --Peter Constable
Re: Vertical BIDI
fantasai scripsit: > If another style rule changes the block progression to rl, what should > happen to the Ogham? Should it now go top to bottom? It should not. That's what makes Ogham different from standard horizontal scripts -- it does have a preferred vertical orientation, and because turning it upside-down generates different *characters*, you can't violate that. > >Also, it's not just punctuation marks that need to get vertical glyphs > >in vertical formats, it's also things like BOPOMOFO LETTER I. > > Are you sure you're not confusing that with the KATAKANA-HIRAGANA > PROLONGED SOUND MARK? Not sure, but I had understood that bopomofo i (which is just one stroke) was rotated when vertical. -- My corporate data's a mess! John Cowan It's all semi-structured, no less. http://www.ccil.org/~cowan But I'll be carefree[EMAIL PROTECTED] Using XSLT http://www.reutershealth.com On an XML DBMS.
Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))
Andrew C. West scripsit: > Thus, if "tb-lr" were supported, your browser would display the > following HTML line as vertical Mongolian with embedded Ogham reading > top-to-bottom, but in a plain text editor, the Mongolian and Ogham > would both read LTR, and everyone would be happy : I don't know about that. I wouldn't be too happy trying to read English with the Latin letters laid out bt-rl and lying on their left sides to boot. On paper is one thing, but on a non-rotatable screen? I don't think so. -- "We are lost, lost. No name, no business, no Precious, nothing. Only empty. Only hungry: yes, we are hungry. A few little fishes, nassty bony little fishes, for a poor creature, and they say death. So wise they are; so just, so very just." --Gollum[EMAIL PROTECTED] www.ccil.org/~cowan
Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))
Chris Jacobs scripsit: > So if people pronounce it as > > twenty-one > esriem we achad > > then they probably indeed write the digit 2 first. Indeed, but the difficulty is that various Arabic colloquials don't agree on the order of pronouncing numbers -- and modern standard Arabic uses the least-significant-digit first style: one and twenty and three hundred and -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] Lope de Vega: "It wonders me I can speak at all. Some caitiff rogue did rudely yerk me on the knob, wherefrom my wits still wander." An Englishman: "Ay, a filchman to the nab betimes 'll leave a man crank for a spell." --Harry Turtledove, Ruled Britannia
Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))
Jony Rosenne scripsit: > However, in Hebrew and Arabic, numbers are written left to right and so are > Latin and other LTR script quotations. So RTL really means mixed direction, > and the bidi algorithm is there to handle it automatically with little user > intervention. BTW, Peter Daniels told me viva voce that arabophones, like persophones and hebraeophones, do (hand)write numbers LTR starting with the most significant digit. But we still have no confirmation from a native arabophone. And if someone could explain the full significance of the Arabic-Indic vs. the Eastern Arabic-Indic digits (other than glyph shape), I'd appreciate it. I know that the EAI digits work just like the European ones, whereas the AI digits work differently, but what is the effective difference? > All of this is completely irrelevant to boustraphedon and vertical scripts. > These are presentation issues that have not need for Unicode support. Vertical Ogham does, but forced override is sufficient -- it doesn't need an *implicit* bidi algorithm. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.comhttp://www.ccil.org/~cowan Humpty Dump Dublin squeaks through his norse Humpty Dump Dublin hath a horrible vorse But for all his kinks English / And his irismanx brogues Humpty Dump Dublin's grandada of all rogues. --Cousin James
Re: Interleaved collation of related scripts (was: Phoenician)
Peter Kirk scripsit: > >I would have just as many objections to doing that as I would with > >unifying it with Hebrew. Users don't expect this kind of interfiling > >when looking things up in ordered lists. Interfiling of scripts > >impedes legibility. > > Well, I see the point. But presumably the only people who would collate > a text containing a mixture of Hebrew and Phoenician, for example, are > those who know and understand both scripts. For anyone else this is a > matter of garbage in, garbage out. So it should be up to these users to > decide whether the legibility concern, which is a real one, is more > important than their otherwise expressed preference for interfiling. In addition, it's important to always remember that "collation" is a cover term for both sorting *and* searching. Collating Hebrew with "Phoenician" at the first level means that a search using Hebrew letters will find "Phoenician" text as well. (I am using horror quotes to remind people that Unicode "Phoenician" includes many non-Punic 22CWSAs, particularly Palaeo-Hebrew.) If indeed Serbs prefer collation equivalence between Cyrillic and Latin (which can only be a tailored preference, of course; in general we don't want to do that), this means not only that they will see the two interfiled in a sorted list, but also that searching for a Serbian word in Cyrillic will find it in Latin and vice versa. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan Female celebrity stalker, on a hot morning in Cairo: "Imagine, Colonel Lawrence, ninety-two already!" El Auruns's reply: "Many happy returns of the day!"
Re: interleaved ordering (was RE: Phoenician)
Philippe Verdy scripsit: > Full collation between Phoenician and Hebrew is not really needed: > the texts are part of separate corpus, and the original documents > do not mix these scripts in the same words. Remember that "Phoenician" in this context includes Palaeo-Hebrew, an we *have* seen evidence that this script is mixed with Square in the same text, though not in the same word. -- Evolutionary psychology is the theory John Cowan that men are nothing but horn-dogs, http://www.ccil.org/~cowan and that women only want them for their money. http://www.reutershealth.com --Susan McCarthy (adapted) [EMAIL PROTECTED]
Re: OT [was TR35]
John Hudson scripsit: > Jony Rosenne wrote: > > >Mozilla's main value is for non-Windows platforms. > > And for people who are unimpressed by Outlook's security track record. The main reason I spoke of the Outlook addiction is that (at least as of the last time I looked at the question) it is practically impossible to get one's data (saved emails, saved calendar entries, etc.) out of the Outlook database in usable form. In particular, emails with attachments are practically beyond reconstruction. Mozilla-based email systems use plain mbox/Eudora format, which at least maintains the emails in a way that's easy to understand. Me, I use mutt. GUI-based mail clients are just too slow. -- John Cowan [EMAIL PROTECTED] http://www.ccil.org/~cowan Most languages are dramatically underdescribed, and at least one is dramatically overdescribed. Still other languages are simultaneously overdescribed and underdescribed. Welsh pertains to the third category. --Alan King
Re: Phoenician
Christopher Fynn scripsit: > OTOH applications that generate collated lists should ideally provide a > straightforward means of applying special tailoring tables. "Should ideally" are the operative words; in most cases, we're lucky if we get default collating behavior rather than UTF-16 or UTF-8/UTF-32 binary sorting. That's why it's important what the content of the default collation is, and that it get things right for at least a large subset of users. -- Overhead, without any fuss, the stars were going out. --Arthur C. Clarke, "The Nine Billion Names of God" John Cowan <[EMAIL PROTECTED]>
Re: OT [was TR35]
Jony Rosenne scripsit: > When I travel, I change the time rather than the time zone, because > changing the time zone causes Outlook to mess up my calendar. This > causes my e-mails to have a wrong time stamp. Is there any solution > to this? AFAIK the only cure is to break the Outlook addiction. -- I suggest you call for help, John Cowan or learn the difficult art of mud-breathing.[EMAIL PROTECTED] --Great-Souled Sam http://www.ccil.org/~cowan
Everson-bashing (was: Phoenician)
Peter Kirk scripsit: > But have the others agreed with his judgments because they are convinced > of their correctness? Or is it more that the others have trusted the > judgments of the one they consider to be an expert, and have either not > dared to stand up to him or have simply been unqulified to do so? This is laughable. > It amazes me that all of the existing scripts have apparently been encoded > without any properly documented justification apart from one expert's > unchallenged judgments. It would be amazing if it were true, but of course it's absolutely false. > And these two cases are hardly a good advertisement for the expert's > reputation. The Coptic/Greek unification proved to be ill-advised and is > being undone. As for the unified W and Q, well, I guess that if the > Kurds and others who use these letters in Cyrillic knew how this > decision would mean that their alphabet will never be sorted correctly > (unless they get round to tailoring their collations), they would make a > strongly argued case for disunification. Nobody writes Kurdish in Cyrillic any more: it's a historic use of the script only. In any event, Michael had *nothing* to do with those unifications. He has consistently pressed for disunification (rightly, IMHO). > Well, perhaps the expert can > feel how much his fingers have been burned by over-unification and so is > now pressing for everything to be disunified. Nonsense, and insulting nonsense to boot. Michael has never pressed for either total unification or total disunification, because both positions are absurd, and his position is never absurd. (I may disagree with it from time to time, and I am willing to press him for reasons, but I *always* respect his point of view.) This verbal sniping on a subject (the history of character encoding) you know nothing about is beneath you. Try and do better. > And then there is the matter of CJK unification, which I gather is still > rather contentious. Only among the invincibly ignorant. -- John Cowan <[EMAIL PROTECTED]> http://www.ccil.org/~cowan "One time I called in to the central system and started working on a big thick 'sed' and 'awk' heavy duty data bashing script. One of the geologists came by, looked over my shoulder and said 'Oh, that happens to me too. Try hanging up and phoning in again.'" --Beverly Erlebacher
Re: Phoenician
E. Keown scripsit: > I guess this is a flame, right? > but what on earth does it mean? > > > Hardly. If the rest of you hadn't agreed with his > > judgments most of the time, the Roadmap might look > > quite different. It's more like Potter > > Stewart on pornography. > > Who's Potter Stewart? (I don't own a TV).Elaine *lol* A former Associate Justice of the U.S. Supreme Court, who memorably declared in a 1964 concurring opinion that he could not define pornography, but he knew it when he saw it (and the movie in question wasn't it). -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan "It's the old, old story. Droid meets droid. Droid becomes chameleon. Droid loses chameleon, chameleon becomes blob, droid gets blob back again. It's a classic tale." --Kryten, Red Dwarf
Re: Phoenician
Mark Davis scripsit: > - But I'm good at it, because invariably when I say it's a tree, > I agree with myself. Hardly. If the rest of you hadn't agreed with his judgments most of the time, the Roadmap might look quite different. It's more like Potter Stewart on pornography. -- John Cowan www.reutershealth.com www.ccil.org/~cowan [EMAIL PROTECTED] The Penguin shall hunt and devour all that is crufty, gnarly and bogacious; all code which wriggles like spaghetti, or is infested with blighting creatures, or is bound by grave and perilous Licences shall it capture. And in capturing shall it replicate, and in replicating shall it document, and in documentation shall it bring freedom, serenity and most cool froodiness to the earth and all who code therein. --Gospel of Tux
Re: Default Ordering (was: Re: Phoenician)
Kenneth Whistler scripsit: > (Encoded as distinct scripts, by the > way, despite their clear and evident historic relationship > to each other, and despite the fact that Japanese can obviously > read both of them with great facility -- if you guys want to > take that particular bone in your mouth and chew on it for > awhile... consider Kana the 48CEAS *hehe*) Of course they would have to be. But if the Japanese had ditched their kanji and wrote mostly in hiragana, with katakana used very rarely -- say, about as frequent in running text as italicized foreign words in Latin-script running text -- they might not have bothered to encode them separately. > If it turns out to make the most sense for a default table > to have 22CWSA scripts (as John puts it) sort with interleaved > primary weights, it is technically feasible to generate a > table that way. (Although not for Hebrew versus Arabic versus > Syriac, which are treated distinctly for primary weights now.) Oh, I quite agree. Arabic and Syriac are out of the picture here: too many consonants, too different. > It isn't a foregone conclusion what the UTC and WG2 will do on > this issue -- it, like the encoding of the Phoenician > (~ Old Canaanite, ~ Old West Semitic) script itself, is a > matter for technical debate and decision. Which we are now having the preliminary part of. -- You escaped them by the will-death John Cowan and the Way of the Black Wheel. [EMAIL PROTECTED] I could not. --Great-Souled Samhttp://www.ccil.org/~cowan
Re: Fraser (was RE: Public Review Issues Updated)
Kenneth Whistler scripsit: > Fraser is to Latin approximately as Tangut is to Han. It is what > you get when you create a de novo script for a completely > different language, but you have a very limited notion of > what a "letter" is supposed to look like. I'm sold. Separate script it is. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan If a traveler were informed that such a man [as Lord John Russell] was leader of the House of Commons, he may well begin to comprehend how the Egyptians worshiped an insect. --Benjamin Disraeli