Re: Internal Representation of Unicode
[EMAIL PROTECTED] scripsit: > First they'd want numeric value properties added to the Hebrew and > Greek letters, then when they came to do the same for the Latin letters > the ensuing flamewar would bring the whole effort to a standstill. Numeric values for Hebrew, Greek, and Cyrillic make a lot of sense, actually. -- I am expressing my opinion. When myJohn Cowan honorable and gallant friend is called, [EMAIL PROTECTED] he will express his opinion. This is http://www.ccil.org/~cowan the process which we call Debate. --Winston Churchill
RE: Internal Representation of Unicode
Yeah, but dude, wasting time on stupid ideas goes with the territory if you happen to be a creative genius. Some of your ideas won't work. Others will be magnificent. I'd put good money on the notion that if Newton had been prevented from pursuing astrology or numerology, this restriction would have had serious negative consequences for his genius, and then maybe we wouldn't have calculus or the laws of motion either. Creativity is all about playing with your mind, not about playing in a sandbox. Anyone who doesn't understand that is doomed to constantly cripple the _expression_ of genius, to the detriment of society as a whole. I can see why someone would want to make a console or terminal emulator work with Unicode. I've messed around myself with ideas to make it work (haven't come up with anything yet though). I say, go for it Johann. If it turns out to have been a good idea, people will use it. If not, you will have learned a great deal. It's definitely a no-lose situation. Jill > -Original Message- > Isaac Newton spent an unconscionable amount of time, by our standards, messing about with astrology and numerology -- far more than he ever put into physics or calculus. The "standardization" of science since his day has helped reduce such effects.
Re: Internal Representation of Unicode
> > At 11:15 AM 9/30/03 -0400, John Cowan wrote: > >> Isaac Newton spent an unconscionable amount > >> of time, by our standards, messing about with astrology and > numerology > > > > One of the aspects of character encoding and standardization that > > seems to have an unholy fascination for people is its numerical > > aspect. It starts with the catalog number for 10646, which was > > deliberately jiggered to incorporate the number 646, which is the > > catalog number for the 7-bit standards. It continues with the desire > > to see certain characters are specific code locations (for example the > > byte order mark) and continues with the never-ending stream of > > (re-)encoding forms. > > > > It's just human nature, I guess. > > > > Maybe we should add something to the submission form: "Has this > proposal been approved by a numerologist?" > First they'd want numeric value properties added to the Hebrew and Greek letters, then when they came to do the same for the Latin letters the ensuing flamewar would bring the whole effort to a standstill. Still, there are good reasons for the BOM being where it is...
Re: Internal Representation of Unicode
On 2003年9月30日, at 下午12:01, Asmus Freytag wrote: At 11:15 AM 9/30/03 -0400, John Cowan wrote: Isaac Newton spent an unconscionable amount of time, by our standards, messing about with astrology and numerology One of the aspects of character encoding and standardization that seems to have an unholy fascination for people is its numerical aspect. It starts with the catalog number for 10646, which was deliberately jiggered to incorporate the number 646, which is the catalog number for the 7-bit standards. It continues with the desire to see certain characters are specific code locations (for example the byte order mark) and continues with the never-ending stream of (re-)encoding forms. It's just human nature, I guess. Maybe we should add something to the submission form: "Has this proposal been approved by a numerologist?" John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://homepage..mac.com/jhjenkins/
Re: Internal Representation of Unicode
At 11:15 AM 9/30/03 -0400, John Cowan wrote: Isaac Newton spent an unconscionable amount of time, by our standards, messing about with astrology and numerology One of the aspects of character encoding and standardization that seems to have an unholy fascination for people is its numerical aspect. It starts with the catalog number for 10646, which was deliberately jiggered to incorporate the number 646, which is the catalog number for the 7-bit standards. It continues with the desire to see certain characters are specific code locations (for example the byte order mark) and continues with the never-ending stream of (re-)encoding forms. It's just human nature, I guess. A./
Re: Internal Representation of Unicode
Jill Ramonsky scripsit: > Ludvig, this Pastoral Symphony of yours all seems to me like something > of a pointless excercise. > And Albert, this "Theory of Relativity" of yours all seems to me like > something of a pointless excercise. > > Never discourage someone else's creativity. The whole point of standardization is to *redirect* (not discourage) people's creativity into useful channels. Isaac Newton spent an unconscionable amount of time, by our standards, messing about with astrology and numerology -- far more than he ever put into physics or calculus. The "standardization" of science since his day has helped reduce such effects. -- John Cowanhttp://www.ccil.org/~cowan [EMAIL PROTECTED] Please leave your valuesCheck your assumptions. In fact, at the front desk. check your assumptions at the door. --sign in Paris hotel --Cordelia Vorkosigan
RE: Internal Representation of Unicode
Ludvig, this Pastoral Symphony of yours all seems to me like something of a pointless excercise. And Albert, this "Theory of Relativity" of yours all seems to me like something of a pointless excercise. Never discourage someone else's creativity. Jill > -Original Message- > From: Rick McGowan [mailto:[EMAIL PROTECTED] > Sent: Friday, September 26, 2003 5:05 PM > To: [EMAIL PROTECTED] > Subject: Re: Internal Representation of Unicode > > This all seems to me like something of a pointless excercise.
Re: Internal Representation of Unicode
James Kass wrote on 09/26/2003 12:03:42 AM: > Peter Constable (IIRC) reported on this list a while ago that there was > a Latin-based writing system used for an indigenous South American > language which stacks up to three marks above. Good memory, James! The language is Ticuna. Peter
Re: Internal Representation of Unicode
myrkraverk...sourceforge wrote: > In a plain text environment, there is often a need to encode more than > just the plain character. ... > Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or > EMUE. ... > I thought of dividing the 64 bit code space into 32 variably wide > plains, one for control characters, one for latin characters, one for > han characters, and so on; This all seems to me like something of a pointless excercise. Or maybe you're not making clear what is your intented audience of users and problems that you're trying to solve. Decent libraries exist that already do nice things with strings having attributes. And that, in my opinion, is a better model than bit-hacking in a 64-bit space with vague implementation-defined attributes that change depending on the "script" of a character. Such "attributed strings" are easy to work with and provide a much higher-level model than this. You might want to check out Apple's Cocoa environment, particularly the definitions of the attributed string classes. For example... http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Java/Classes/NSAttributedString.html or even the intro: http://developer.apple.com/documentation/Cocoa/Conceptual/AttributedStrings/index.html I'm sure there are libraries with similar capabilities for storing characters + attributes in Java and other languages, I'm just not familiar with them. Maybe some of the developers can chime in with their favorite attributed string libraries. Even if you don't use one, you might find the attributed string model educational. (All of the above of course reflects only my personal opinion.) Rick
RE: Internal Representation of Unicode
[EMAIL PROTECTED] wrote: > In a plain text environment, there is often a need to encode more than > just the plain character. A console, or terminal emulator, is such an > environment. Therefore I propose the following as a technical report > for internal encoding of unicode characters; with one goal in mind: > character equalence is binary equalence. I guess you meant "equivalence". Q1: But what are "character equivalence" and "binary equivalence", and why did you choose them as your goals? > I thought of dividing the 64 bit code space into 32 variably wide > plains, Q2: What are these "plains" for? Why are there 32 of them? > one for control characters, one for latin characters, one for > han characters, Q3: Why do you want to treat Latin character and Han characters differently? There is nothing special with Latin or Han characters in Unicode: they are just 2 of the about 50 scripts currently supported in Unicode. (see http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://www.unicode.org/Public/UNIDATA/Scripts.txt) Q4: And how do you plan to distinguish them? Both Latin and Han characters are scattered all over the Unicode space, so you need to check many ranges to determine which character belongs to which category. Q5: And what about all character which are neither Latin nor Han? > and so on; using 5 bits and the next 3 fixed to zero > (for future expansion and alignment to an octet). > I call plain 0 control characters and won't discuss it further. Q6: Why do control characters have a special handling? Q7: Don't control characters have properties attached like any other characters? One example of properties which could be useful to attach to control character is directionality. E.g., a TAB is always a TAB but, after it passed through the Bidirectional Algorithm, its directionality can be resolved to be either LTR or RTL. > Plain 1, I had intended for latin characters with the following > encoding method in mind: > > bits 63..59 58..56 55..40 39..32 31..24 23..16 15..8 7..0 > +---+--+--+--+--+--+--+--+ > | plain | zero | attr | res | uacc | lacc | res | char | > +---+--+--+--+--+--+--+--+ > > * Plain Plain(5 bits) > * Zero Zero bits(3 bits) > * Attr Attributes (16 bits) Q8: What kind of information are these three fields for? Q9: In case your answer to Q8 is "they are application-defined", then what is the rationale for defining and naming more than one field? I mean: if they are application-defined, why not leave the task of defining sub-fields to the application? > * Res Reserved (8 bits) > * Uacc Upper Accent (8 bits) > * Lacc Lower Accent (8 bits) Q10:Why do treat "accents" specially? They are just characters as any others. In Unicode there is no special limitation as to how many "accents" can be applied to a base character. There is also no obligation for accents to have a base character. > * Res Reserved (8 bits) > * Char Character(8 bits) Q11:How can you store a Latin character in 8 bits? Unicode has 938 Latin characters, and their codes range from U+0041 to U+FF5A. > All of these fields are actually implementation defined, with just one > rule for char: don't include characters that can be made with > combinations, that's what the accent fields are for. But characters are non necessarily decomposed in one "Latin character" with one "upper accent" and one "lower accent". E.g., U+01D5 (LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304 (LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both COMBINING DIAERESIS and COMBINING MACRON are "upper accents". Q12:How are you going to deal with a combination of, e.g., a base letter + 5 "upper accents" + 3 "lower accents"? > This allows for 255 upper and lower accents which should be enough -- for now. I counted 129 "upper accents". But their codes range from U+0300 to U+1D1AD. Q13:How are you going to compress these codes into 8 bits? Are you planning to use a conversion table from the Unicode code to your internal 8-bit code? > For Han characters I thought of the following encoding method (with no > particular plain in mind): > > bits 63..59 58..56 55..40 39..32 31 ..0 > +---+--+--+---+--+ > | plain | zero | attr | style | char| > +---+--+--+---+--+ > > * Plain Plain(5 bits) > * Zero Zero bits(3 bits) > * Attr Attributes (16 bits) > * Style Stylistic Variation (8 bits) Q14:What kind of information is in field "Style"? Q15:Why do only Han characters have
Re: Internal Representation of Unicode
On 25/09/2003 20:52, [EMAIL PROTECTED] wrote: Hi, John Cowan writes: > The problem is that multiple accents above are quite common -- Vietnamese > depends on them heavily. There may also be multiple accents below, > for all I know. That does not have to be a problem, as long as there are no more than 255 accents and combinations of them. As for vietnamese, I just don't know how many there are, or how many characters they use. Johann In Hebrew there are more than 255 accents and combinations of them, if you count vowel points, dagesh, shin and sin dots as accents. There are potentially many thousands of combinations. From a quick search, something like 700-800 are in actual use in the Hebrew Bible. -- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/
Re: Internal Representation of Unicode
. Jóhann Gunnar Óskarsson wrote, > That does not have to be a problem, as long as there are no more than > 255 accents and combinations of them. As for vietnamese, I just don't > know how many there are, or how many characters they use. The Combining Diacritical Marks range of Unicode 4.0 lists 107 combining marks which can be used in any combination. Some combining marks are supposed to span two base characters. Peter Constable (IIRC) reported on this list a while ago that there was a Latin-based writing system used for an indigenous South American language which stacks up to three marks above. Best regards, James Kass .
Re: Internal Representation of Unicode
Johann wrote: > That does not have to be a problem, as long as there are no more than > 255 accents and combinations of them. As for vietnamese, I just don't > know how many there are, or how many characters they use. You'll need UTF-8 and a fairly comprehensive font to read the following. For Vietnamese, you should count on supporting the following vowels: a à ả ã á ạ ă ằ ẳ ẵ ắ ặ â ầ ẩ ẫ ấ ậ e è ẻ ẽ é ẹ ê ề ể ễ ế ệ i ì ỉ ĩ í ị o ò ỏ õ ó ọ ô ồ ổ ỗ ố ộ ơ ờ ở ỡ ớ ợ u ù ủ ũ ú ụ ư ừ ử ữ ứ ự y ỳ ỷ ỹ ý ỵ the following consonant (in addition to most other English consonants): đ and this currency sign: ₫ For purposes of your mechanism, you can think of each vowel as having up to 2 accents: (upper, right-attached, or none) plus (upper, lower, or none). The way Vietnamese think of it is that the circumflex, breve, and horn are part of the base letter (making a total of 12 base vowels), whereas the grave, hook above, tilde, acute, and dot below are considered diacritics (6 × 12 = 72 total vowels). All combinations are possible. Of course, all of the letters (not the dong sign) come in both uppercase and lowercase. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: Internal Representation of Unicode
Hi, John Cowan writes: > The problem is that multiple accents above are quite common -- Vietnamese > depends on them heavily. There may also be multiple accents below, > for all I know. That does not have to be a problem, as long as there are no more than 255 accents and combinations of them. As for vietnamese, I just don't know how many there are, or how many characters they use. Johann -- Emacs is not a text editor -- it's a way of life
Re: Internal Representation of Unicode
[EMAIL PROTECTED] scripsit: > All of these fields are actually implementation defined, with just one > rule for char: don't include characters that can be made with > combinations, that's what the accent fields are for. This allows for > 255 upper and lower accents which should be enough -- for now. The problem is that multiple accents above are quite common -- Vietnamese depends on them heavily. There may also be multiple accents below, for all I know. -- John Cowan http://www.ccil.org/~cowan [EMAIL PROTECTED] Be yourself. Especially do not feign a working knowledge of RDF where no such knowledge exists. Neither be cynical about RELAX NG; for in the face of all aridity and disenchantment in the world of markup, James Clark is as perennial as the grass. --DeXiderata, Sean McGrath