Question about U+170D, which I hope will become TAGALOG LETTER RA
Greetings, I write this letter with questions regarding a proposal I hope to make for the encoding of TAGALOG LETTER RA, which we locally know as the baybayin letter "ra", at U+170D. Many fonts are already using this unencoded codepoint for TAGALOG LETTER RA in breach of the standard. TAGALOG LETTER RA looks like TAGALOG LETTER DA, U+1707, with an extra stroke. For examples, see Norman de los Santos' Unicode baybayin fonts.[2] Paul Morrow's fonts, which are used on the Philippine peso, also include "ra" outside of the ones meant to be exact digitizations of the first baybayin fonts.[4] I had previously assumed that this space had been left open in anticipation of the future encoding of TAGALOG LETTER RA, but that this hadn't happened due to apathy; however I've since been informed that the space was left open as an oversight of sorts, considering that four Philippine scripts were encoded at once as a result of WG2 proposal N1933.[1] I hope to request this as the Google Noto developers will not follow the de facto standard unless it is given the Consortium's approval.[3] My questions are: • How old do I need to prove the letter is? Baybayin "ra" is not used in writing Old Tagalog and is not used in the earliest Tagalog texts. However, it certainly has existed since at least 1985,[4; under heading Bikol Mintz] and perhaps decades earlier. • May I use signs and fonts as evidence? What types of documents may I use? • Would anyone volunteer to help me write this proposal, or check it over before I send it? Thank you. [1]: https://www.unicode.org/L2/L1999/n1933.pdf [2]: http://nordenx.blogspot.com/p/downloads.html [3]: https://github.com/googlefonts/noto-fonts/issues/1185 [4]: http://paulmorrow.ca/fonts.htm
Re: Update to the second question summary (was: A sign/abbreviation for "magister")
> On 2 Dec 2018, at 20:29, Janusz S. Bień via Unicode > wrote: > > On Sun, Dec 02 2018 at 10:33 +0100, Hans Åberg via Unicode wrote: >> >> It was common in the 1800s to singly and doubly underline superscript >> abbreviations in handwriting according to [1-2], and [2] also mentions >> the abbreviation discussed in this thread. > > Thank you very much for this reference to the very abbreviation! I > looked up Wikipedia but I haven't read it carefully enough :-( Quite of a coincidence, as I was looking at the article topic, and it happened to have this remark embedded! >> 1. https://en.wikipedia.org/wiki/Ordinal_indicator >> 2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1
Update to the second question summary (was: A sign/abbreviation for "magister")
On Sun, Dec 02 2018 at 10:33 +0100, Hans Åberg via Unicode wrote: >> On 30 Oct 2018, at 22:50, Ken Whistler via Unicode >> wrote: >> >> On 10/30/2018 2:32 PM, James Kass via Unicode wrote: >>> but we can't seem to agree on how to encode its abbreviation. >> >> For what it's worth, "mgr" seems to be the usual abbreviation in Polish for >> it. > > It was common in the 1800s to singly and doubly underline superscript > abbreviations in handwriting according to [1-2], and [2] also mentions > the abbreviation discussed in this thread. Thank you very much for this reference to the very abbreviation! I looked up Wikipedia but I haven't read it carefully enough :-( > > 1. https://en.wikipedia.org/wiki/Ordinal_indicator > 2. https://en.wikipedia.org/wiki/Ordinal_indicator#cite_note-1 Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Preformatted superscript in ordinary text, paleography and phonetics using Latin script (was: Re: A sign/abbreviation for "magister" - third question summary)
On 06/11/2018 12:04, Janusz S. Bień via Unicode wrote: On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote: Hi! On the over 100 years old postcard https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 you can see 2 occurences of a symbol which is explicitely explained (in Polish) as meaning "Magister". [...] The third and the last question is: how to encode this symbol in Unicode? A constructive answer to my question was provided quickly by James Kass: On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: Mr͇ / M=ͬ I answered: On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bień via Unicode wrote: [...] For me only the latter seems acceptable. Using COMBINING LATIN SMALL LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as the base character. However in the lack of a better solution I can live with it :-) An alternative would be to use SMALL EQUALS SIGN, but looks like fonts supporting it are rather rare. and Philippe Verdy commented: On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote: [...] There's a third alternative, that uses the superscript letter r, followed by the combining double underline, instead of the normal letter r followed by the same combining double underline. Some comments were made also by Michael Everson: On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: [...] I would encode this as Mʳ if you wanted to make sure your data contained the abbreviation mark. It would not make sense to encode it as M=ͬ or anything else like that, because the “r” is not modifying a dot or a squiggle or an equals sign. The dot or squiggle or equals sign has no meaning at all. And I would not encode it as Mr͇, firstly because it would never render properly and you might as well encode it as Mr. or M:r, and second because in the IPA at least that character indicates an alveolar realization in disordered speech. (Of course it could be used for anything.) FYI, I decided to use the encoding proposed by Philippe Verdy (if I understand him correctly): Mʳ̳ i.e. 'LATIN CAPITAL LETTER M' (U+004D) 'MODIFIER LETTER SMALL R' (U+02B3) 'COMBINING DOUBLE LOW LINE' (U+0333) for purely pragmatic reasons: it is rendered quite well in my Emacs. According to the 'fc-search-codepoint" script, the sequence is supported on my computer by almost 150 fonts, so I hope to find in due time a way to render it correctly also in XeTeX. I'm also going to add it to my private named sequences list (https://bitbucket.org/jsbien/unicode4polish). The same post contained a statement which I don't accept: On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: [...] The squiggle in your sample, Janusz, does not indicate anything; it is only a decoration, and the abbreviation is the same without it. One of the reasons I disagree was described by me in the separate thread "use vs mention": https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html There were also some other statements which I find unacceptable: On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: [...] The abbreviation in the postcard, rendered in plain text, is "Mr". He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at 9:38 GMT (and earlier in a private mail). I understand that both of them by "plane text" mean Unicode. On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote: You could use the various hacks you've discussed, with modifier letters; but that is not "encoding", that is "abusing Unicode to do markup". At least, that's the view I take! and was supported by Asmus Freytag on Wed, Oct 31 2018 at 3:12 -0700. The latter elaborated his view later and I answered: On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bień via Unicode wrote: On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: [...] All else is just applying visual hacks I don't mind hacks if they are useful and serve the intended purpose, even if they are visual :-) [...] at the possible cost of obscuring the contents. It's for the users of the transcription to decide what is obscuring the text and what, to the contrary, makes the transcription more readable and useful. Please note that it's me who makes the transcription, it's me who has a vision of the future use and users, and in consequence it's me who makes the decision which aspects of text to encode. Accusing me of "abusing Unicode" will not stop me from doing it my way. I hope that at least James Kass understands my attitude: On Mon, Oct 29 2018 at 7:57 GMT, James Kass via Unicode wrote: [...] If I were entering plain text data from an old post card, I'd try to keep the data as close to the source as possible. Because that would be my purpose. Others might have different purposes. There were presented also some ideas which I would call "futuristic": in
A sign/abbreviation for "magister" - third question summary
On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote: > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". > [...] > The third and the last question is: how to encode this symbol in > Unicode? A constructive answer to my question was provided quickly by James Kass: On Sat, Oct 27 2018 at 19:52 GMT, James Kass via Unicode wrote: > Mr͇ / M=ͬ I answered: On Sun, Oct 28 2018 at 18:28 +0100, Janusz S. Bień via Unicode wrote: [...] > For me only the latter seems acceptable. Using COMBINING LATIN SMALL > LETTER R is a natural idea, but I feel uneasy using just EQUALS SIGN as > the base character. However in the lack of a better solution I can live > with it :-) > > An alternative would be to use SMALL EQUALS SIGN, but looks like fonts > supporting it are rather rare. and Philippe Verdy commented: On Sun, Oct 28 2018 at 18:54 +0100, Philippe Verdy via Unicode wrote: [...] > > There's a third alternative, that uses the superscript letter r, > followed by the combining double underline, instead of the normal > letter r followed by the same combining double underline. Some comments were made also by Michael Everson: On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: [...] > I would encode this as Mʳ if you wanted to make sure your data > contained the abbreviation mark. It would not make sense to encode it > as M=ͬ or anything else like that, because the “r” is not modifying a > dot or a squiggle or an equals sign. The dot or squiggle or equals > sign has no meaning at all. And I would not encode it as Mr͇, firstly > because it would never render properly and you might as well encode it > as Mr. or M:r, and second because in the IPA at least that character > indicates an alveolar realization in disordered speech. (Of course it > could be used for anything.) FYI, I decided to use the encoding proposed by Philippe Verdy (if I understand him correctly): Mʳ̳ i.e. 'LATIN CAPITAL LETTER M' (U+004D) 'MODIFIER LETTER SMALL R' (U+02B3) 'COMBINING DOUBLE LOW LINE' (U+0333) for purely pragmatic reasons: it is rendered quite well in my Emacs. According to the 'fc-search-codepoint" script, the sequence is supported on my computer by almost 150 fonts, so I hope to find in due time a way to render it correctly also in XeTeX. I'm also going to add it to my private named sequences list (https://bitbucket.org/jsbien/unicode4polish). The same post contained a statement which I don't accept: On Sun, Oct 28 2018 at 20:42 GMT, Michael Everson via Unicode wrote: [...] > The squiggle in your sample, Janusz, does not indicate anything; it is > only a decoration, and the abbreviation is the same without it. One of the reasons I disagree was described by me in the separate thread "use vs mention": https://unicode.org/mail-arch/unicode-ml/y2018-m10/0133.html There were also some other statements which I find unacceptable: On Mon, Oct 29 2018 at 12:20 -0700, Doug Ewell via Unicode wrote: [...] > The abbreviation in the postcard, rendered in plain text, is "Mr". He was supported by Julian Bradfield in his mail on Wed, Oct 31 2018 at 9:38 GMT (and earlier in a private mail). I understand that both of them by "plane text" mean Unicode. On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote: > You could use the various hacks you've discussed, with modifier > letters; but that is not "encoding", that is "abusing Unicode to do > markup". At least, that's the view I take! and was supported by Asmus Freytag on Wed, Oct 31 2018 at 3:12 -0700. The latter elaborated his view later and I answered: On Fri, Nov 02 2018 at 17:20 +0100, Janusz S. Bień via Unicode wrote: > On Fri, Nov 02 2018 at 5:09 -0700, Asmus Freytag via Unicode wrote: [...] >> All else is just applying visual hacks > > I don't mind hacks if they are useful and serve the intended purpose, > even if they are visual :-) [...] >> at the possible cost of obscuring the contents. > > It's for the users of the transcription to decide what is obscuring the > text and what, to the contrary, makes the transcription more readable > and useful. Please note that it's me who makes the transcription, it's me who has a vision of the future use and users, and in consequence it's me who makes the decision which aspects of text to encode. Accusing me of "abusing Unicode" will not stop me from doing it my way. I hope that at least James Kass understands my attitude: On Mon, Oct 29 2018 at 7:57 GMT, James Kass via Unicode wrote: [...] > If I were entering plain text data from an old post card, I'd try to > keep the data as close to the so
A sign/abbreviation for "magister" - second question summary
On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote: > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". [...] > The second question is: are you familiar with such or a similar symbol? > Have you ever seen it in print? Later I provided some additional information: On Sat, Oct 27 2018 at 16:09 +0200, Janusz S. Bień via Unicode wrote: > > The postcard is from the front of the first WW written by an > Austro-Hungarian soldier. He explaines the meaning of the abbreviation > to his wife, so looks like the abbreviation was used but not very > popular. On Sat, Oct 27 2018 at 20:25 +0200, Janusz S. Bień via Unicode wrote: [...] > In the meantime I looked up some other postcards written by the same > person i found several other abbreviation including № 'NUMERO SIGN' > (U+2116) written in the same way, i.e. with a double instead of a single > line. The similarity to № 'NUMERO SIGN' was mentioned quite often in the thread, there seem to be no need to quote all this mentions here. A more general observation was formulated by Richard Wordingham: On Sun, Oct 28 2018 at 8:13 GMT, Richard Wordingham via Unicode wrote: [...] > The notation is a quite widespread format for abbreviations. the > first letter is normal sized, and the subsequent letter is written in > some variety of superscript with a squiggle underneath so that it > doesn't get overlooked. Various examples of such abbreviations were also mentioned several times in the thread, but again there seem to be no need to quote all this mentions here. Nobody however reported any other occurence of the symbol in question. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
A sign/abbreviation for "magister" - first question summary
On Sat, Oct 27 2018 at 14:10 +0200, Janusz S. Bień via Unicode wrote: > Hi! > > On the over 100 years old postcard > > https://photos.app.goo.gl/GbwNwYbEQMjZaFgE6 > > you can see 2 occurences of a symbol which is explicitely explained (in > Polish) as meaning "Magister". > > First question is: how do you interpret the symbol? For me it is > definitely the capital M followed by the superscript "r" (written in an > old style no longer used in Poland), but there is something below the > superscript. It looks like a small "z", but such an interpretation > doesn't make sense for me. I've got almost immediately two complementary answers: On Sat, Oct 27 2018 at 9:11 -0400, Robert Wheelock wrote: > It is constructed much like the symbol for numero—only with a capital > accompanied by a superscript small > having an underbar (or > double underbar). On Sat, Oct 27 2018 at 6:58 -0700, Asmus Freytag via Unicode wrote: [...] > My suspicion would be that the small "z" is rather a "=" that > acquired a connecting stroke as part of quick handwriting. A./ and on the same day this interpretation was supported by Philippe Verdy: On Sat, Oct 27 2018 at 20:35 +0200, Philippe Verdy via Unicode wrote: [...] > I have the same kind of reading, the zigzagging stroek is an > hnadwritten emphasis of the uperscript r above it (explicitly noting > it is terminating the abbreviation), jut like the small underline that > happens sometimes below the superscript o in the abbreviation of > "numero" (as well sometimes there was not just one but two small > underlines, including in some prints). > > This sample is a perfect example of fast cursive handwritting (due to > high variability of all other letter shapes, sizes and joinings, where > even the capital M is written as two unconnected strokes), and it's > not abnormal to see in such condition this cursive joining between the > two underlining strokes so that it looks like a single zigzag. Later it was summarized by James Kass: On Fri, Nov 02 2018 at 2:59 GMT, James Kass via Unicode wrote: > Alphabetic script users write things the way they are spelled and > spell things the way they are written. The abbreviation in question > as written consists of three recognizable symbols. An "M", a > superscript "r", and an equal sign (= two lines). It can be printed, > handwritten, or in fraktur; it will still consist of those same three > recognizable symbols. > > We're supposed to be preserving the past, not editing it or revising > it. It was commented by Julian Bradfield: On Fri, Nov 02 2018 at 8:54 GMT, Julian Bradfield via Unicode wrote: [...] > That's not true. The squiggle under the r is a squiggle - it is a > matter of interpretation (on which there was some discussion a hundred > messages up-thread or so :) whether it was intended to be = . > Just as it is a matter of interpretation whether the superscript and > squiggle were deeply meaningful to the writer, or whether they were > just a stylistic flourish for Mr. The abbreviation in question definitely consists of three symbols: an "M", a superscript "r" and the third one, which I think was best described by Robert Wheelock as double (under)bar, with the connecting stroke mentioned first by Asmus Freytag. This third element was referred to, also by myself, as a squiggle, but after looking up the definition of the word in a dictionary a short line that has been written or drawn and that curves and twists in a way that is not regular I think this is a misnomer. Unfortunately I have no better proposal. Best regards Janusz -- , Janusz S. Bien emeryt (emeritus) https://sites.google.com/view/jsbien
Re: Shortcuts question
Note: CLDR concentrates on keyboard layout for text input. Layouts for other functions (such as copy-pasting, gaming controls) are completely different (and not necessarily bound directly to layouts for text, as they may also have their own dedicated physical keys or users can reprogram their keyboard for this; for gaming, softwares should all have a way to customize the layout according to users need, and should provide reasonnable defaults for at least the 3 base layouts: QWERTY, AZERTY and QWERTZ, but I've never seen any game whose UI was tuned for Dvorak) Le lun. 17 sept. 2018 à 16:42, Marcel Schneider a écrit : > On 17/09/18 05:38 Martin J. Dürst wrote: > [quote] > > > > From my personal experience: A few years ago, installing a Dvorak > > keyboard (which is what I use every day for typing) didn't remap the > > control keys, so that Ctrl-C was still on the bottom row of the left > > hand, and so on. For me, it was really terrible. > > > > It may not be the same for everybody, but my experience suggests that it > > may be similar for some others, and that therefore such a mapping should > > only be voluntary, not default. > > Got it, thanks! > > Regards, > > Marcel >
Re: Shortcuts question
On 17/09/18 05:38 Martin J. Dürst wrote: [quote] > > From my personal experience: A few years ago, installing a Dvorak > keyboard (which is what I use every day for typing) didn't remap the > control keys, so that Ctrl-C was still on the bottom row of the left > hand, and so on. For me, it was really terrible. > > It may not be the same for everybody, but my experience suggests that it > may be similar for some others, and that therefore such a mapping should > only be voluntary, not default. Got it, thanks! Regards, Marcel
Re: Shortcuts question
On 2018/09/16 21:08, Marcel Schneider via Unicode wrote: An additional level of complexity is induced by ergonomics. so that most non-Latin layouts may wish to stick with QWERTY, and even ergonomic layouts in the footprints of August Dvorak rather than Shai Coleman are likely to offer variants with legacy Virtual Key mapping instead of staying in congruency with graphics optimized for text input. From my personal experience: A few years ago, installing a Dvorak keyboard (which is what I use every day for typing) didn't remap the control keys, so that Ctrl-C was still on the bottom row of the left hand, and so on. For me, it was really terrible. It may not be the same for everybody, but my experience suggests that it may be similar for some others, and that therefore such a mapping should only be voluntary, not default. Regards, Martin.
Re: Shortcuts question
For games, the mnemonic meaning of keys are unlikely to be used because gamers prefer an ergonomic placement of their fingers according to the physical position for essential commands. But this won't apply to control keys, as these commands should be single keystrokes and pressing two keys instead of one would be unpractical and would be a disavantage when playing. That's why the four most common 4 direction keys A/D/S/W on a QWERTY layout will become Q/D/S/Z on a French AZERTY layout. Games that use logical key layouts based on QWERTY are almost unplayable if there's no interface to customize these 4 keys. So games preferably use the virtual keys instead for these commands, or will include builtin layouts adapted for AZERTY and QWERTZ-based layouts and still display the correct keycaps in the UI: games normally don't force the switch to another US layout, so they still need to use the logical layout, simply because they also need to allow users to input real text and not jsut gaming commands (for messaging, or for inputing custom players/objects created in the game itself, or to fill-in user profiles, or input a registration email or to perform online logon with the correct password), in which case they will also need to support characters entered with control keys (AltGr, Shift, Control...), or with a standard tactile panel on screen which will still display the common localized layouts. There are difficulties in games when some of their commands are mapped to something else than just basic Latin letters (including decimal digits : on a French AZERTY keyboard, the digits are composed by pressing Shift, or in ShiftLock mode (there's no CapsLock mode as this ShiftLock is also released when pressing Shift: just like on old French mechanical typewriters, pressing ShiftLock again did not release it, and this ShiftLock applied to all keys on the keyboard, including punctuation keys. On PC keyboards, ShiftLock does not apply to the numeric pad which has its separate NumLock, now largely redundant and that most users would like to disable completely each time there's a numeric pad separated from the directional pad, on these extended keyboards, NumLock is just a nuisance, notably on OS logon screen when Windows turns it off by default unless the BIOS locks it at boot time, and lot of BIOS don't do that or don't have the option to set it permanently). Le dim. 16 sept. 2018 à 14:18, Marcel Schneider via Unicode < unicode@unicode.org> a écrit : > On 15/09/18 15:36, Philippe Verdy wrote: > […] > > So yes all control keys are potentially localisable to work best with > the base layout anre remaining mnemonic; > > but the physical key position may be very different. > > An additional level of complexity is induced by ergonomics. so that most > non-Latin layouts may wish to stick > with QWERTY, and even ergonomic layouts in the footprints of August Dvorak > rather than Shai Coleman are > likely to offer variants with legacy Virtual Key mapping instead of > staying in congruency with graphics optimized > for text input. But again that is easier on Windows, where VKs are > remapped separately, than on Linux that > appears to use graphics throughout to process application shortcuts, and > only modifiers can be "preserved" for > further processing, no underlying letter map that AFAIU appears not to > exist on Linux. > > However, about keyboarding, that may be technically too detailed for this > List, so that I’ll step out of this thread > here. Please follow up in parallel thread on CLDR-users instead. > > https://unicode.org/pipermail/cldr-users/2018-September/000837.html > > Thanks, > > Marcel > > >
Re: Shortcuts question
On 15/09/18 15:36, Philippe Verdy wrote: […] > So yes all control keys are potentially localisable to work best with the > base layout anre remaining mnemonic; > but the physical key position may be very different. An additional level of complexity is induced by ergonomics. so that most non-Latin layouts may wish to stick with QWERTY, and even ergonomic layouts in the footprints of August Dvorak rather than Shai Coleman are likely to offer variants with legacy Virtual Key mapping instead of staying in congruency with graphics optimized for text input. But again that is easier on Windows, where VKs are remapped separately, than on Linux that appears to use graphics throughout to process application shortcuts, and only modifiers can be "preserved" for further processing, no underlying letter map that AFAIU appears not to exist on Linux. However, about keyboarding, that may be technically too detailed for this List, so that I’ll step out of this thread here. Please follow up in parallel thread on CLDR-users instead. https://unicode.org/pipermail/cldr-users/2018-September/000837.html Thanks, Marcel
Re: Shortcuts question
Le ven. 7 sept. 2018 à 05:43, Marcel Schneider via Unicode < unicode@unicode.org> a écrit : > On 07/09/18 02:32 Shriramana Sharma via Unicode wrote: > > > > Hello. This may be slightly OT for this list but I'm asking it here as > it concerns computer usage with multiple scripts and i18n: > > It actually belongs on CLDR-users list. But coming from you, it shall > remain here while I’m posting a quick answer below. > > > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for > "tout" io Ctrl+A for "all"? > > No, Ctrl+A remains Ctrl+A on a French keyboard. > Yes but the location on the keyboard maps to the same as CTRL+Q on a Qwerty layout: CTRL+ASCII letter are mapped according to the layout of the letter (without pressing CTRL) on the localized keyboard. Some keyboard layouts don't have all the basic Latin letters becaues their language don't need it (e.g. it may only have one of Q or K, but no C, or it may have no W, or some letters may be holding combined diacritics or could be ligatures, but usuall the basic Latin letter is still accessible by pressing another control key or by switching the layout mode. On non Latin keyboard layouts there's much more freedom, and CTRL+A may be localized according to the main base letter assigned to the key (the position of Latin letter is not always visible). On tactile layouts you cannot guess where CTRL+Latin letter is located, actually it may be accessible very differently on a separate layout for controls, where they will be translated: the CTRL key is not necessarily present, replaced usually by a single key for input mode selection (which may be switching languages, or to emojis, or to symbols/punctuations/digits)... The problematic control keys are those like "CTRL+[" (assuming ASCII as the base layout) where "[" is not present or mapped very differently. As well "CTRL+1"..."CTRL+0" may conflict with the assignment of ASCII controls like "CTRL+[". So yes all control keys are potentially localisable to work best with the base layout anre remaining mnemonic; but the physical key position may be very different.
Re: Shortcuts question (is: Thread transfer info)
Hello, I’ve followed up on CLDR-users: https://unicode.org/pipermail/cldr-users/2018-September/000837.html As a sidenote — It might be hard to get a selection of discussions actually happen on CLDR-users instead of Unicode Public mail list, as long as subscribers of this list don’t necessarily subscribe to the other list, too, that still has way less subscribers than Unicode Public. Regards, Marcel
Re: Shortcuts question
Shriramana Sharma: > > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for > "tout" io Ctrl+A for "all"? Some are, many are not. For instance, some text editors use a modifier key with F and K instead of B and I for bold ("fett") and italic ("kursiv"). > 2) How about when the shortcuts are the Alt+ combinations referring to > underlined letters in actual user visible strings? Those depend much more language dependent than Ctrl/Cmd shortcuts. > 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt > the other XCV shortcuts) Z key or the Y key which is in the physical > position of the QWERTY Z key (and close to the other XCV shortcuts)? For some shortcuts the key position is more important (e.g. the one left from the 1 key), for others it's the initial / conventional letter of the command. Most QWERTZ users are not used to expect the undo shortcut (Z) next to the keys for cut (X), copy (C) and paste (V). By the way, accompanying redo is notoriously inconsistent, sometimes Y, sometimes Shift+Z. More serious problems arise with non-letter keys. For instance, square brackets [ and ] are readily available on the US / English keyboard layout, but require modifier keys like Shift or Alt on many other keyboard layouts, which may be the same ones as for the curly braces { and }. This means, some seemingly simple and intuitive shortcuts on an English keyboard become cumbersome on international ones.
Re: Shortcuts question
On 07/09/18 02:32 Shriramana Sharma via Unicode wrote: > > Hello. This may be slightly OT for this list but I'm asking it here as it > concerns computer usage with multiple scripts and i18n: It actually belongs on CLDR-users list. But coming from you, it shall remain here while I’m posting a quick answer below. > 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" > io Ctrl+A for "all"? No, Ctrl+A remains Ctrl+A on a French keyboard. > 2) How about when the shortcuts are the Alt+ combinations referring to > underlined letters in actual user visible strings? I don’t know, but the accelerator shortcuts usually process text input, so it would be up to the vendor to keep them in sync. > 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the > other XCV shortcuts) Z key or the Y key > which is in the physical position of the QWERTY Z key (and close to the other > XCV shortcuts)? On Windows, that this question refers to, virtual keys move around with graphics on Latin keyboards. While Ctrl+Z on QWERTZ is not handy, I can tell that it is Ctrl+Z on AZERTY with the key having the Z on it and typing "z". The latter is most relevant on Linux where graphics are used even to process the Ctrl+ shortcuts. > 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic > or Japanese? On Windows as they depend on Virtual Keys, they may be laid out on an underlying QWERTY basis. The same may apply on macOS, where distinct levels are present in the XML keylayout (and likewise in system-shipped layouts) to map the letters associated with shortcuts, regardless of the script. On Linux, shortcuts are reported not to work on some non-Latin keyboard layouts (because key names are based on ISO key positions, and XKB doesn’t appear to use a "Group0" level to map the shortcut letters; needs to be investigated). > 4a) I mean how are they displayed on screen? My short answer is: I’ve got no experience; maybe using Latin letters and locale labels. > 4b) Like #1 above, are they changed per language? Non-Latin scripts typically use QWERTY for ASCII input, so shortcuts may not be changed per language. > 4c) Like #2 above, how about for user visible shortcuts? Again I’m leaving this over to non-Latin script experts. > (In India since English is an associate official language, most computer > users are at least conversant with basic English > so we use the English/QWERTY shortcuts even if the keyboard physically shows > an Indic script.) The same applies to virtually any non-Latin locale. Michael Kaplan reported that only on Latin keyboards VKs move around. > Thanks! You are welcome. Marcel
Shortcuts question
Hello. This may be slightly OT for this list but I'm asking it here as it concerns computer usage with multiple scripts and i18n: 1) Are shortcuts like Ctrl+C changed as per locale? I mean Ctrl+T for "tout" io Ctrl+A for "all"? 2) How about when the shortcuts are the Alt+ combinations referring to underlined letters in actual user visible strings? 3) In a QWERTZ layout for Undo should one still press the (dislocated wrt the other XCV shortcuts) Z key or the Y key which is in the physical position of the QWERTY Z key (and close to the other XCV shortcuts)? 4) How are shortcuts handled in the case of non Latin keyboards like Cyrillic or Japanese? 4a) I mean how are they displayed on screen? 4b) Like #1 above, are they changed per language? 4c) Like #2 above, how about for user visible shortcuts? (In India since English is an associate official language, most computer users are at least conversant with basic English so we use the English/QWERTY shortcuts even if the keyboard physically shows an Indic script.) Thanks!
Re: Question about Karabakh Characters
It is legitimate to add characters for Armenian dialectology, and if you can provide additional evidence of usage in lexicography and (if possible) in other literature, we can see if a proposal can be made. We may do this offline so as to save the list from to many files. I look forward to hearing from you. Nothing will happen, though, without further information. Michael > On 5 Oct 2017, at 06:09, via Unicode <unicode@unicode.org> wrote: > > Thank you for your reply. > I am currently handling technical support to publish in multi-language. > > This was found when we were handling a project on the Karabakh language. > I was informed that Karabakh has a dictionary containing over 40,000 words > that was produced in 2013 which employs the three characters. > I personally have not seen this dictionary, but it seems that are ones that > need these characters. > So I decided to make a post. > > Kazunari Tsuboi > > -Original Message- > From: Michael Everson [mailto:ever...@evertype.com] > Sent: Wednesday, October 4, 2017 11:31 PM > To: Tsuboi, Kazunari > Cc: unicode Unicode Discussion > Subject: Re: Question about Karabakh Characters > > They are not encoded, but that example is not sufficient. If you’d like to > contact me offline we can discuss this further. > > Michael Everson > >> On 4 Oct 2017, at 08:39, via Unicode <unicode@unicode.org> wrote: >> >> Hi there, >> >> The Karabakh language uses Armenian characters, but the following >> characters do not have a Unicode assigned. (image1.JPG attached) They >> are pronounced “Yi”, “Ini” and “Eh” and used with several >> combinations. (Image2.JPG attached) >> >> Is there any reason these characters are not supported by Unicode? >> I would appreciate any related information. >> >> Thank you! >> >> Kazunari Tsuboi >> > >
RE: Question about Karabakh Characters
Thank you for your reply. I am currently handling technical support to publish in multi-language. This was found when we were handling a project on the Karabakh language. I was informed that Karabakh has a dictionary containing over 40,000 words that was produced in 2013 which employs the three characters. I personally have not seen this dictionary, but it seems that are ones that need these characters. So I decided to make a post. Kazunari Tsuboi -Original Message- From: Michael Everson [mailto:ever...@evertype.com] Sent: Wednesday, October 4, 2017 11:31 PM To: Tsuboi, Kazunari Cc: unicode Unicode Discussion Subject: Re: Question about Karabakh Characters They are not encoded, but that example is not sufficient. If you’d like to contact me offline we can discuss this further. Michael Everson > On 4 Oct 2017, at 08:39, via Unicode <unicode@unicode.org> wrote: > > Hi there, > > The Karabakh language uses Armenian characters, but the following > characters do not have a Unicode assigned. (image1.JPG attached) They > are pronounced “Yi”, “Ini” and “Eh” and used with several > combinations. (Image2.JPG attached) > > Is there any reason these characters are not supported by Unicode? > I would appreciate any related information. > > Thank you! > > Kazunari Tsuboi >
Re: Question about Karabakh Characters
They are not encoded, but that example is not sufficient. If you’d like to contact me offline we can discuss this further. Michael Everson > On 4 Oct 2017, at 08:39, via Unicodewrote: > > Hi there, > > The Karabakh language uses Armenian characters, but the following characters > do not have a Unicode assigned. (image1.JPG attached) > They are pronounced “Yi”, “Ini” and “Eh” and used with several combinations. > (Image2.JPG attached) > > Is there any reason these characters are not supported by Unicode? > I would appreciate any related information. > > Thank you! > > Kazunari Tsuboi >
Question about Karabakh Characters
Hi there, The Karabakh language uses Armenian characters, but the following characters do not have a Unicode assigned. (image1.JPG attached) They are pronounced "Yi", "Ini" and "Eh" and used with several combinations. (Image2.JPG attached) Is there any reason these characters are not supported by Unicode? I would appreciate any related information. Thank you! Kazunari Tsuboi
Re: XCCS (was: Historical question about 'universal signs')
See pg. 57-63 of this: Xerox. (1985). *Xerox System Network Architecture: General Information Manua*l (No. XNSG 068504). Retrieved from http://archive.org/details/bitsavers_xeroxxnsXNNetworkArchitectureGeneralInformationMan_10024221 SE On Sun, Oct 23, 2016 at 10:01 AM, Doug Ewellwrote: > seth erickson wrote: > > XCCS is fairly well documented >> > > That hasn't been my experience. I'd be interested in any links you can > forward that go beyond "Unicode built on" or "drew ideas from" or "was > influenced by" XCCS. > > Thanks, > > -- > Doug Ewell | Thornton, CO, US | ewellic.org >
XCCS (was: Historical question about 'universal signs')
seth erickson wrote: XCCS is fairly well documented That hasn't been my experience. I'd be interested in any links you can forward that go beyond "Unicode built on" or "drew ideas from" or "was influenced by" XCCS. Thanks, -- Doug Ewell | Thornton, CO, US | ewellic.org
Historical question about 'universal signs'
Greetings Unicoders, I'm trying to find information (for research purposes) about a character set mentioned in Joseph Becker's 1988 draft proposal [1]: "In 1978, the initial proposal for a set of 'Universal Signs' was made by Bob Belleville at Xerox PARC. Many persons contributed ideas to the development of a new encoding design. Beginning in 1980, these efforts evolved into the Xerox Character Code Standard (XCCS) [...]" XCCS is fairly well documented but I'm having trouble finding anything about the proposal by Bob Belleville. Any pointers would be appreciated. Thanks, Seth Erickson PhD student Department of Information Studies University of California, Los Angeles [1] http://unicode.org/history/unicode88.pdf
Re: Question about Perl5 extended UTF-8 design
On 11/06/2015 01:32 PM, Richard Wordingham wrote: On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell"wrote: Richard Wordingham wrote: No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8, or anything of the sort, even if the bit-shifting algorithm is based on UTF-8. "UTF-8 encoding form" is defined as a mapping of Unicode scalar values -- not arbitrary integers -- onto byte sequences. [D92] If it extends the mapping of Unicode scalar values *into* byte sequences, then it's an extension. A non-trivial extension of a mapping of scalar values has to have a larger domain. I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks. Richard. I have no idea how my original message ended up being marked to send to this list. I'm sorry. It was meant to be a personal message for someone who I believe was involved in the original design.
Re: Question about Perl5 extended UTF-8 design
Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich: First of all, “reserved” means that they have no meaning. Right? Almost. “Reserved” means that they have currently no meaning but may be assigned a meaning, later; hence you ought not use them lest your programs, or data, be invalidated by later amendmends of the pertinent specification. In contrast, “invalid”, or “ill-formed” (Unicode term), means that the particular bit pattern may never be used in a sequence that purports to represent Unicode characters. In practice, that means that no programm is allowed to send those ill-formed patterns in Unicode-based data exchange, and every program should refuse to accept those ill-formed patterns, in Unicode-based data exchange. What a program does internally is at the discretion (or should I say: “whim”?) of its author, of course – as long as the overall effect of the program complies with the standard. Best wishes, Otto Stolz
Re: Question about Perl5 extended UTF-8 design
On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell"wrote: > Richard Wordingham wrote: > > > No-one's claiming it is for a Unicode Transformation Format (UTF). > > Then they ought not to call it "UTF-8" or "extended" or "modified" > UTF-8, or anything of the sort, even if the bit-shifting algorithm is > based on UTF-8. > "UTF-8 encoding form" is defined as a mapping of Unicode scalar values > -- not arbitrary integers -- onto byte sequences. [D92] If it extends the mapping of Unicode scalar values *into* byte sequences, then it's an extension. A non-trivial extension of a mapping of scalar values has to have a larger domain. I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks. Richard.
Re: Question about Perl5 extended UTF-8 design
It won't represent any valid Unicode codepoint (no standard scalar value defined), so if you use those leading bytes, don't pretend it is for "UTF-8" (not even "modified UTF-8" which is the variant created in Java for its internal serialization of unrestricted 16-bit strings, including for lone surrogates, and modified also in its representation of U+ as <0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create your own charset identifier (e.g. "perl5-UTF-8-extended" or some name derived from your Perl5 library) and say it is not fot use for interchange of standard text. The extra code points you'll get are then necessarily for private use (but still not part of the standard PUA set), and have absolutely no defined properties from the standard. They should not be used to represent any Unicode character or character sequence. In any API taking some text input, those code points will never be decoded and will behave on input like encoding errors. But these extra code points could be used to represent someting else such as unique object identifier for internal use in your application, or virtual object pointers, or or shared memory block handles, file/pipe/stream I/O handles, service/API handles, user ids, security tokens, 64-bit content hashes plus some binary flags, placeholders/references for members in an external unencoded collection or for URIs, or internal glyph ids when converting text for rendering with one or more fonts, or some internal serialization of geometric shapes/colors/styles/visual effects...) In the standard UTF-8 those extra byte values are not "reserved" but permanently assigned to be "invalid", and there are no valid encoded sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC version of UTF-8 when it allowed code points up to 31 bits, but even this RFC is obsolete and should no longer be used and it has never been approved by Unicode). 2015-11-05 16:57 GMT+01:00 Karl Williamson: > Hi, > > Several of us are wondering about the reason for reserving bits for the > extended UTF-8 in perl5. I'm asking you because you are the apparent > author of the commits that did this. > > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the > length of the sequence of bytes that comprise a single character to be 13 > bytes. This allows code points up to 2**72 - 1 to be represented. If the > length had been instead 12 bytes, code points up to 2**66 - 1 could be > represented, which is enough to represent any code point possible in a > 64-bit word. > > The comments indicate that these extra bits are "reserved". So we're > wondering what potential use you had thought of for these bits. > > Thanks > > Karl Williamson >
Question about Perl5 extended UTF-8 design
Hi, Several of us are wondering about the reason for reserving bits for the extended UTF-8 in perl5. I'm asking you because you are the apparent author of the commits that did this. To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the length of the sequence of bytes that comprise a single character to be 13 bytes. This allows code points up to 2**72 - 1 to be represented. If the length had been instead 12 bytes, code points up to 2**66 - 1 could be represented, which is enough to represent any code point possible in a 64-bit word. The comments indicate that these extra bits are "reserved". So we're wondering what potential use you had thought of for these bits. Thanks Karl Williamson
Re: Question about Perl5 extended UTF-8 design
On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdywrote: > (0xFF was reserved only in the old RFC version of UTF-8 when it allowed > code points up to 31 bits, but even this RFC is obsolete and should no > longer be used and it has never been approved by Unicode). > No, even in the original UTF-8 definition, "The octet values FE and FF never appear." https://tools.ietf.org/html/rfc2279 The highest lead byte was 0xFD. (For the "really original" version see http://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf) In the current definition, "The octet values C0, C1, F5 to FF never appear." https://tools.ietf.org/html/rfc3629 = https://tools.ietf.org/html/std63 markus
Re: Question about Perl5 extended UTF-8 design
On Thu, 5 Nov 2015 18:25:05 +0100 Philippe Verdywrote: > But these extra code points could be used to represent someting else > such as unique object identifier for internal use in your > application, or virtual object pointers, or or shared memory block > handles, file/pipe/stream I/O handles, service/API handles, user ids, > security tokens, 64-bit content hashes plus some binary flags, > placeholders/references for members in an external unencoded > collection or for URIs, or internal glyph ids when converting text > for rendering with one or more fonts, or some internal serialization > of geometric shapes/colors/styles/visual effects...) No-one's claiming it is for a Unicode Transformation Format (UTF). A possibly relevant example of a something else is a non-precomposed grapheme cluster, as in Perl6's NFG. (This isn't a PUA encoding, as the precomposed characters are created on the fly.) Richard.
Re: Question about Perl5 extended UTF-8 design
Richard Wordingham wrote: > No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8, or anything of the sort, even if the bit-shifting algorithm is based on UTF-8. "UTF-8 encoding form" is defined as a mapping of Unicode scalar values -- not arbitrary integers -- onto byte sequences. [D92] -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Question about Perl5 extended UTF-8 design
On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote: > Several of us are wondering about the reason for reserving bits for > the extended UTF-8 in perl5. I'm asking you because you are the > apparent author of the commits that did this. To start, the INTERNAL REPRESENTATION of Perl’s strings is the «utf8» format (not «UTF-8», «extended» or not). [I see that this misprint caused a lot of stir here!] However, outside of a few contexts, this internal representation should not be visible. (However, some of these contexts are close to the default, like read/write in Unicode mode, with -C switch.) Perl’s string is just a sequence of Perl’s unsigned integers. [Depending on the build, this may be, currently, 32-bit or 64-bit.] By convention, the “meaning” of small integers coincides with what Unicode says. > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes > the length of the sequence of bytes that comprise a single character > to be 13 bytes. This allows code points up to 2**72 - 1 to be > represented. If the length had been instead 12 bytes, code points up > to 2**66 - 1 could be represented, which is enough to represent any > code point possible in a 64-bit word. > > The comments indicate that these extra bits are "reserved". So > we're wondering what potential use you had thought of for these > bits. First of all, “reserved” means that they have no meaning. Right? Second, there are 2 ways in which one may need this INTERNAL format to be extended: • 128-bit architectures may be at hand (sooner or later). • One may need to allow “objects” to be embedded into Perl strings. With embedded objects, one must know how to kill them when the string (or its part) is removed. So, while a pointer can fit into a Perl integer, one needs to specify what to do: call DESTROY, or free(), or a user-defined function. This gives 5 possibilities (3 extra bits) which may be needed with “slots” in Perl strings. • Integer (≤64 bits) • Integer (≥65 bits) • Pointer to a Perl object • Pointer to a malloc()ed memory • Pointer to a struct which knows how to destroy itself. struct self_destroy { void *content; void destroy(struct self_destroy*); } Why one may need objects embedded into strings? I explained it in http://ilyaz.org/interview (look for «Emacs» near the middle). Hope this helps, Ilya
Re: Question about Perl5 extended UTF-8 design
2015-11-05 23:11 GMT+01:00 Ilya Zakharevichwrote > > • 128-bit architectures may be at hand (sooner or later). This is specialation for something that is still not envisioned: a global worldwide working space where users and applications would interoperate transparently in a giant virtualized environment. However, this virtualized environment will be supported by 64-bit OSes that will never need native support of more the 64-bit pointers. Those 128-bit entities needed for adressing will not be used to work on units of data but to address some small selection of remote entities. Softwares that would requiring parsing coompletely chunks of memory data larger than 64-bit would be extremely inefficient, instead this data will be internally structured/paged, and only virtually mapped to some 128 bit global reference (such as GUID/UUIDs) only to select smaller chunks within the structure (and in most cases those chunks will remain in a 32-bit space (even in today's 64-bit OSes, the largest pages are 20-bit wide, but typically 10-bit wide (512-byte sectors) to 12-bit wide (standard VMM and I/O page sizes, networking MTUs), or about 16-bit wide (such as transmission window for TCP). This will not eveolve significantly before a major evolution in the worldwide Internet backbones requiring more than about 1Gigabit/s (a speed not even needed for 4K HD video, but needed only in massive computing grids, still built with a complex mesh of much slower data links). With 64-bit we already reach the physical limits of networking links, and higher speeds using large buses are only for extremely local links whose lengths are largely below a few millimters within chips themselves. 128 bit however is possible not for the working spaces (or document sizes) it will be very unlikely that ANSI C/C++ "size_t" type will be more than 64-bit (ecept for a few experimentations which will fail to be more efficient). What is more realist is that internal buses and caches will be 128 bits or even larger (this is already true for GPU memory), only to support more parallelism or massive parallelism (and typically by using vectored instructions working on sets of smaller values). And some data need 128-bit values for their numerical ranges (ALUs in CPU/GPU/APU are already 128-bit, as well as common floating point types) where extra precision is necessary. I doubt we'll ever see any true native 128-bit architecture in any time of our remaining life. We are still very far from the limit of the 64-bit architecture and it won't happend before the next century (if the current sequential binary model for computing is still used at that time, may be computing will use predictive technologies returning only heuristic results with a very high probability of giving a good solution to the problems we'll need to solve extremely rapidly, and those solutions will then be validated using today's binary logic with 64-bit computing). Even in the case where a global 128-bit networking space would appear, users will never be exposed to all that, msot of this content will be unacessible to them (restricted by secuiry concerns or privacy) and simply unmanageable by them : no one on earth is able to have any idea of what 2^64 bits of global data represents, no one will ever need it in their whole life. That amount of data will only be partly implemented by large organisations trying to build a giant cloud and whiching to interoperate by coordinating their addressing spaces (for that we have now IPv6). So your "sooner or later" is very optimistic. IMHO we'll stay with 64-bit architectures for very long, up to the time where our seuqnetial computing model will be deprecated and the concept of native integer sizes will be obsoleted and replaced by other kinds of computing "units" (notably parallel vectors, distributed computing, and heuristic computing, or may be optical computing based on Fourier transforms on analog signals or quantum computing, where our simple notion of "integers" or even "bits" will not even be placeable into individual physically placed units; their persistence will not even be localized, and there will be redundant/fault-tolerant placements). In fact our computing limits wil no longer be in terms of storage space, but in terms of access time, distance and predictability of results. The next technologies for faster computing will be certainly predictive/probabilistic rather than affirmative (with today's Turing/Von Neumann machines). "Algorithms" for working with it will be completely different. Fuzzy logic will be everywhere and we'll even need less the binary logic except for small problems. We'll have to live with the possibility of errors but anyway we already have to live with them evne with our binary logic (due to human bugs, haardware faults, accidents, and so on...) In most problems we don't even need to have 100% proven solutions (e.g. viewing a high-quality video, we already accept the
Re: Question about the Sentence_Break property
On 02/20/2015 04:56 PM, Philippe Verdy wrote: 2015-02-20 6:14 GMT+01:00 Richard Wordingham richard.wording...@ntlworld.com mailto:richard.wording...@ntlworld.com: TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. One thing that is missing is mention of the convention that a single newline character (or CRLF pair) is a line break whereas a doubled newline character denotes a paragraph break. In that case CR or LF characters alone are not paragraph separators by themselves unless they are grouped together. Like NEL, they should just be considered as line separators and the terminology used in UAX 29 rule SB4 is effectively incorrect if what matters here is just the linebreak property. And also in that case, the SB4 rule should effecticely include NEL (from the C1 subset). But as SB4 is only related to sentence breaking, It would be e problem because simple linebreaks are used extremely frequently in the middle of sentences. What the Sentence break algorithm should say is that there should first be a preprossing step separating line breaks and paragraph breaks (creating custom entities,(similar to collation elements, but encoded internally with a code point out of the standard space), that the rule SB4 would use instead of Sep | CR | LF. That custome entity should be Sep but without the rule defining it, as there are various ways to represent paragraph breaks. But isn't SB4 contradictory to this from TUS Section 5.8? R2c In parsing, choose the safest interpretation. For example, in recommendation R2c an implementer dealing with sentence break heuris- tics would reason in the following way that it is safer to interpret any NLF as LS: • Suppose an NLF were interpreted as LS, when it was meant to be PS. Because most paragraphs are terminated with punctuation anyway, this would cause misidentification of sentence boundaries in only a few cases. • Suppose an NLF were interpreted as PS, when it was meant to be LS. In this case, line breaks would cause sentence br eaks, which would result in significant problems with the sentence break heuristics It seems to me SB4 is choosing the non-safer way. What am I missing? ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about the Sentence_Break property
On Thu, 19 Feb 2015 19:55:20 -0700 Karl Williamson pub...@khwilliamson.com wrote: UAX 29 says this: Break after paragraph separators. SB4. Sep | CR | LF Why are CR and LF considered to be paragraph separators? NEL and Line Break are as well. My mental model of plain text has it containing embedded characters, which I'll call \n, to allow it to be displayed in a terminal window of a given width. Not all text is like that, of course, but there is an awful lot that is. This rule makes no sense to me. There are two types of plain text - that which requires explicit line-breaking, and that which does not. This is a case where a non-linguistic tailoring is required. TUS has a whole section on the issue, namely TUS 7.0.0 Section 5.8. One thing that is missing is mention of the convention that a single newline character (or CRLF pair) is a line break whereas a doubled newline character denotes a paragraph break. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Question about the Sentence_Break property
UAX 29 says this: Break after paragraph separators. SB4.Sep | CR | LF Why are CR and LF considered to be paragraph separators? NEL and Line Break are as well. My mental model of plain text has it containing embedded characters, which I'll call \n, to allow it to be displayed in a terminal window of a given width. Not all text is like that, of course, but there is an awful lot that is. This rule makes no sense to me. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Philippe Verdy verd...@wanadoo.fr wrote: |glibc is not more borken and any other C library implementing toupper and |tolower from the legacy ctype standard library. These are old APIs that |are just widely used and still have valid contexts were they are simple and |safe to use. But they are not meant to convert text. Hah! Legacy is good.. I'd wish a usable successor were already standardized by ISO C. --steffen ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about Uppercase in DerivedCoreProperties.txt
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: glibc is not more borken and any other C library implementing toupper and tolower from the legacy ctype standard library. These are old APIs that are just widely used and still have valid contexts were they are simple and safe to use. But they are not meant to convert text. Well, of course they are *meant* to convert text. They're just not very good at it. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Successors to convert strings instead of just isolated characters (sorry, they are NOT what we need to handle texts, they are not even equivalent to Unicode characters, they are just code units, most often 8-bit with char or 16-bit only with wchar_t !) already exist in all C libraries (including glibc), under different names unfortunately (this is the main cause why there are complex header files trying to find the appropriate name, and providing a default basic implementation that just scans individual characters to filter them with tolower and toupper: this is a bad practice, Good libraries should all contain a safe implementation of case conversion of strings, and softwares should use them (and not reinvent this old bad trick, just because this works with basic English). 2014-11-10 13:41 GMT+01:00 Steffen Nurpmeso sdao...@yandex.com: Philippe Verdy verd...@wanadoo.fr wrote: |glibc is not more borken and any other C library implementing toupper and |tolower from the legacy ctype standard library. These are old APIs that |are just widely used and still have valid contexts were they are simple and |safe to use. But they are not meant to convert text. Hah! Legacy is good.. I'd wish a usable successor were already standardized by ISO C. --steffen ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
The equivalent of strtolower() and strtoupper() is implemented in all C libraries I know (yes, including glibc) and I have worked with on various OSes (and since very long!), even if their names change (because of the unfortunate lack of standardization about their interaction with C locales). The standardisation of these two functions should have already been made since very long, even if the locales support could be limited to the legacy basic C locale with limited functionality, where these functions would just scan characters through strings to convert them with toupper() and to lower(). But then glibc and other libraries wiould have implemented this standard. For now, we still need complex config scripts to detect the correct headers to include, or to provide a basic implementation via various macros. The standard C++ string package could have then used this standard internally in the methods exposed in its API. I cannot understand this simple effort was never done on such basic functionality needed and used in almost all softwares and OSes. 2014-11-10 19:55 GMT+01:00 Steffen Nurpmeso sdao...@yandex.com: Philippe Verdy verd...@wanadoo.fr wrote: |Successors to convert strings instead of just isolated characters (sorry, |they are NOT what we need to handle texts, they are not even equivalent |to Unicode characters, they are just code units, most often 8-bit with |char or 16-bit only with wchar_t !) already exist in all C libraries |(including glibc), under different names unfortunately (this is the main |cause why there are complex header files trying to find the appropriate |name, and providing a default basic implementation that just scans |individual characters to filter them with tolower and toupper: this is a |bad practice, glibc is the _only_ standard C library i know of that supports its own homebrew functionality regarding the issue (and in a way that i personally don't want to and will never work with). Even the newest ISO C doesn't give just any hand, so that no ISO C programmer can expect to use any standard facility before 2020, if that is the time, and then operating systems have to adhere to that standard, and then programmers have to be convinced to use those functions. Until then different solutions will have to be used. --steffen ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Philippe Verdy verd...@wanadoo.fr wrote: |Successors to convert strings instead of just isolated characters (sorry, |they are NOT what we need to handle texts, they are not even equivalent |to Unicode characters, they are just code units, most often 8-bit with |char or 16-bit only with wchar_t !) already exist in all C libraries |(including glibc), under different names unfortunately (this is the main |cause why there are complex header files trying to find the appropriate |name, and providing a default basic implementation that just scans |individual characters to filter them with tolower and toupper: this is a |bad practice, glibc is the _only_ standard C library i know of that supports its own homebrew functionality regarding the issue (and in a way that i personally don't want to and will never work with). Even the newest ISO C doesn't give just any hand, so that no ISO C programmer can expect to use any standard facility before 2020, if that is the time, and then operating systems have to adhere to that standard, and then programmers have to be convinced to use those functions. Until then different solutions will have to be used. --steffen ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Philippe Verdy verd...@wanadoo.fr wrote: |The standard C++ string package could have then used this standard |internally in the methods exposed in its API. I cannot understand this |simple effort was never done on such basic functionality needed and used in |almost all softwares and OSes. There are plenty of other things one can bang his head on as necessary, _that_ is for sure. Even overwhelmingly, the pessimistic may say. --steffen ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Philippe Verdy verd...@wanadoo.fr さんはかきました: note that tolower() and toupper() can only work one 1-character level, it is not recommended for use for changing case of plain text. For correct handling of locales, to upper and toupper should be replaced by strtolower and strtoupper (or their aliases) which will be able to process character clusters and contextual casing rules needed for a language or orthographic style Yes, thank you for explaining this. But these details of upper and lower casing cannot be expressed in the “i18n” file of glibc: https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n For toupper and tolower, this file just has character - character mapping tables, for example the “tolower” table contains only (U03A3,U03C3) (i.e. mapping Σ U+03A3 - σ U+03C3, never to the final sigma ς U+03C2). More correct, detailed information about upper and lower case must come from elsewhere, not from this “i18n” file in glibc. Using only the information from this “i18n” file, not even the Greek sigma can be handled correctly. Pravin and me want to update this “i18n” file to the latest data from Unicode 7.0.0, doing it as correct as possible within the limitations caused by this file and the ISO C standard. -- Mike FABIAN mfab...@redhat.com ☏ Office: +49-69-365051027, internal 8875027 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Do not try to get consisant results with only a character to character mapping, it does not work with all letters, because sometimes you need 1-2 or 2-1 mappings (not all composable characters exist in precombined forms, or sometimes the combination must be split into its canonical decomposed equivalent prior to map the base character) or other mappings. toupper() and tolower() should not be used for something else than just mapping number-like sequences (e.g. to convert hexadecimal numbers). Use strupper() and strlower() (or equivalent functions not alocating memory but writing to a given buffer or stream, and similiar functions to other languages than C/C++) to perform mappings on full strings so that the string length can safely change. - this is needed for example to convert city names or people names to capitals in a postal address, or to style a book title or chapter heading). - it is needed as well to perform case insensitive searches (using case folding, which is different from converting to lowercase or to uppercase) to match input, or to implement some input completion UI to locate possible matches within a known dictionnary or input history. 2014-11-08 10:22 GMT+01:00 Mike FABIAN mfab...@redhat.com: Philippe Verdy verd...@wanadoo.fr さんはかきました: note that tolower() and toupper() can only work one 1-character level, it is not recommended for use for changing case of plain text. For correct handling of locales, to upper and toupper should be replaced by strtolower and strtoupper (or their aliases) which will be able to process character clusters and contextual casing rules needed for a language or orthographic style Yes, thank you for explaining this. But these details of upper and lower casing cannot be expressed in the “i18n” file of glibc: https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n For toupper and tolower, this file just has character - character mapping tables, for example the “tolower” table contains only (U03A3,U03C3) (i.e. mapping Σ U+03A3 - σ U+03C3, never to the final sigma ς U+03C2). More correct, detailed information about upper and lower case must come from elsewhere, not from this “i18n” file in glibc. Using only the information from this “i18n” file, not even the Greek sigma can be handled correctly. Pravin and me want to update this “i18n” file to the latest data from Unicode 7.0.0, doing it as correct as possible within the limitations caused by this file and the ISO C standard. -- Mike FABIAN mfab...@redhat.com ☏ Office: +49-69-365051027, internal 8875027 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
So glibc is broken. This doesn't make it a Unicode problem. On Sat, Nov 8, 2014 at 8:22 PM, Mike FABIAN mfab...@redhat.com wrote: Philippe Verdy verd...@wanadoo.fr さんはかきました: note that tolower() and toupper() can only work one 1-character level, it is not recommended for use for changing case of plain text. For correct handling of locales, to upper and toupper should be replaced by strtolower and strtoupper (or their aliases) which will be able to process character clusters and contextual casing rules needed for a language or orthographic style Yes, thank you for explaining this. But these details of upper and lower casing cannot be expressed in the “i18n” file of glibc: https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n For toupper and tolower, this file just has character - character mapping tables, for example the “tolower” table contains only (U03A3,U03C3) (i.e. mapping Σ U+03A3 - σ U+03C3, never to the final sigma ς U+03C2). More correct, detailed information about upper and lower case must come from elsewhere, not from this “i18n” file in glibc. Using only the information from this “i18n” file, not even the Greek sigma can be handled correctly. Pravin and me want to update this “i18n” file to the latest data from Unicode 7.0.0, doing it as correct as possible within the limitations caused by this file and the ISO C standard. -- Mike FABIAN mfab...@redhat.com ☏ Office: +49-69-365051027, internal 8875027 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode -- Christopher Vance ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
glibc is not more borken and any other C library implementing toupper and tolower from the legacy ctype standard library. These are old APIs that are just widely used and still have valid contexts were they are simple and safe to use. But they are not meant to convert text. The i18n data just shows the mappings used for tolower, toupper (and totile) but it is clearly not enough to implement strtolower and strtoupper which require more rules (notably 1 to 2 or 2 to 1 mappings, plus support for normalisation/composition/decomposition and recognizing canonical equivalents, in all possible reorderings, and more data for contextual rules such as the final form of sigma). Such data may be be easily expressible in some cases with such tabular format, and could be implemented by locale-specific code, for example to handle some dictionary lookups (as required with some Asian scripts for word breaking, and implicilty needed for the Korean script whose normalisation is not handle by table lookups but algorithmically by code only within the normalizer) I don't see anything wrong with existing glibc 18n data. Glibc would be wrong however if it *only* used tolower/toupper to implement strtolower/strtoupper (but this was what was still done in the past since the creation of the standard C library on Unix and even later on DOS, MacOS, Windows and most other systems... before the creation of Unicode and its development to support more languages, scripts, and orthographic systems.) Modern i18n libraries (for various programming languages) contain more advanced support API for correct case mappings on full strings (including M-to-N mappings, contextual rules and support of canonical equivalences), and these API no longer assume that the output string will be the same length as the input and only 1:1 mappings will be performed over each character (even if this is still what is done when using the C root locale working only for a few languages and only with simple texts using restricted alphabets without all the possible Unicode extensions, needed now to support more than the native language but also many proper names and foreign toponyms, or texts containing small citations in another language, or any multilingual document). 2014-11-09 1:45 GMT+01:00 Christopher Vance cjsva...@gmail.com: So glibc is broken. This doesn't make it a Unicode problem. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
note that tolower() and toupper() can only work one 1-character level, it is not recommended for use for changing case of plain text. Its purpose should be limited to use cases where letters can be safely isolated from their context, for example when handling letters as numbers (e.g. section numbering). For correct handling of locales, to upper and toupper should be replaced by strtolower and strtoupper (or their aliases) which will be able to process character clusters and contextual casing rules needed for a language or orthographic style (such as monotonic and polytonic Greek, or for specific locales intended for medieval texts or old classic scriptures). strupper and strlower can then perform MORE mappings that tolower and toupper cannot perform using only simple mappings. So precombined Greek letters with iota subscripts can only be converted by preserving the iota subscript (for which islower() and isupper() are BOTH false when it is encoded separately and not precombined). When a Greek letter precombined with a iota subscript is found, the letter case of this iota subscript should be ignored, and only the lettercase of the base letter will be considered, and this means that it will only be possible for toupper() and toupper() to map one orthographic style: the style that preserves the subscript but not the classic Greek or modern monotonic style that doesn't know anything about this medieval extension of the Greek alphabet, which was still in use in the begining of the 1970's (handling polytonic Greek with tolower() and toupper(), or with islower() and isupper() will not produce the correct result). For modern Greek, there's no use of this iota subscript, so we are in the same situation as classic Greek (before the Christian era), except that modern Greek still uses a few accents (notably the tonos equivalent in Unicode to the acute accent, even if its placement over Greek capitals is preferably before the letter rather than above it as it could be suggested by its assigned combining class). 2014-11-07 12:32 GMT+01:00 Mike FABIAN mfab...@redhat.com: Philippe Verdy verd...@wanadoo.fr さんはかきました: this is a feature of the Greek alphabet that the lowercase iota subscript can be capitalized in two different ways : either as a subscript below the uppercase main letter, or as a standard iota capitalized. The subscript form is a combining character, but not the non-subscript form. Laurentiu All of the characters you enumerated are titlecase letters Laurentiu (gc=Lt) rather than uppercase letters (gc=Lu), U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ. ᾈ is something like Ἀι so I understand now that ᾈ can be considered as titlecase (gc=Lt). Note that for modern Greek there's still a difficulty about the special final form of lowercase sigma: it is effectively lowercase (islower should return true), not titlecase, and toupper will map it to a standard capital Sigma. But the reverse conversion will only be able to convert the uppercase sigma to a standard lowercase sigma, ignoring the final form. To handle the final form correctly, don't use tolower() character per character, but use strtolower() and use a decent library that supports contextual rules (the same will be true for the German ess-tsett which was capitalized as a two S but not reversible, even if recently an uppercase variant of ess-tsett was added in Unicode, but it is still extremely rarely used: it is extremly difficult to determine how to convert a double capital S and most libraries will only convert it to a double lowercase s, and some locales deliberatly decide not to alter the lowercase ess-tsett with loupper or strtoupper; this is still correct if those libraries have not be updated to use the capital ess-tsett now supported in more recent versions of Unicode, but not found in any other legacy encodings). We still have a difficulty with the ampersand because it has been encoded only as a symbol, assuming that for most used locales it is just used in isolation as an abbreviated form of a word. But in some locales it was still considered a letter and used everywhere et could be used including in abreviations like etc. == c., or in the middle of words like caret == car or commtre == commettre). But the modern use of ampersand implies there's a word break before and after the symbol an we should have a separate encoding for as a lowercase ligature, and we should even have an uppercase variant like the German ess-tsett, as there are glyphic variants of the ligature for uppercased titles where the modern ampersand does not fit very well, or where it should be mapped to a non-ligatured ET letter pair, distinct from the mapping (with spaces around) to ET in French or to AND in English, as implied by the modern meaning of the current symbol as a separate word by itself. With a distinct encoding of the ligature, the common abreviation etc. ligatured as c. would correctly map to uppercase C. with
Re: Question about “Uppercase” in DerivedCoreProperties.txt
Philippe Verdy verd...@wanadoo.fr さんはかきました: this is a feature of the Greek alphabet that the lowercase iota subscript can be capitalized in two different ways : either as a subscript below the uppercase main letter, or as a standard iota capitalized. The subscript form is a combining character, but not the non-subscript form. Now I understand why these are titlecase letters, as Laurentiu explained: Laurentiu All of the characters you enumerated are titlecase letters Laurentiu (gc=Lt) rather than uppercase letters (gc=Lu), U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ. ᾈ is something like Ἀι so I understand now that ᾈ can be considered as titlecase (gc=Lt). Thank you very much, Phillipe and Laurentiu for explaining! I stumbled on this question because I am trying to update the character class data for glibc for Unicode 7.0.0. glibc has character classes “upper” and “lower” but not “title”. Bruno Haible’s program to generate the character class data from UnicodeData.txt tries to enforce that every character which has a “toupper” mapping *must* be in either “upper” or “lower”. https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/gen-unicode-ctype.c;h=0c001b299d4601a375a1e814fd2ab06b0536b337;hb=HEAD#l660 I think Bruno’s program does this because ISO C 99 (ISO/IEC 9899 - Programming languages - C) http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf contains: 7.4.2.2 The toupper function [...] If the argument is a character for which islower is true and there are one or more corresponding characters, as specified by the current locale, for which isupper is true, the toupper function returns one of the corresponding characters (always the same one for any given locale); otherwise, the argument is returned unchanged. which seems to require that toupper should only do something for characters where islower is true. Therefore, Bruno’s program puts title case characters like U+1F88 ᾈ or U+01C5 Dž into *both*, “upper” and “lower”. Which does not look so unreasonable, given the limitations of C99. So it looks like because of this limitation, we have to continue using this approach because ISO C 99 requires it, we cannot use the “Uppercase” property from DerivedCoreProperties.txt for this. But the “Alphabetic” property from DerivedCoreProperties.txt can probably be used to generate the “alpha” character class for glibc. I hope this is correct. -- Mike FABIAN mfab...@redhat.com ☏ Office: +49-69-365051027, internal 8875027 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Question about Uppercase in DerivedCoreProperties.txt
I have a question about “Uppercase” in DerivedCoreProperties.txt: U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI is listed as “Lowercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt : 1F80..1F87; Lowercase # L [8] GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI But “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI” is *not* listed as “Uppercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . Although U+1F80 seems to be Uppercase according to http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt because it has a tolower mapping to U+1F80: 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345N;;;1F88;;1F88 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 0345N1F80; Is the information in DerivedCoreProperties.txt correct or could this be a bug in DerivedCoreProperties.txt? The above is not only the case for U+1F88, but for several more characters. All the characters listed below have a tolower mapping in http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt but are not listed in DerivedCoreProperties.txt as “Uppercase”: U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ Is that correct or a bug? -- Mike FABIAN mfab...@redhat.com ☏ Office: +49-69-365051027, internal 8875027 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Question about “Uppercase” in DerivedCoreProperties.txt
I have a question about “Uppercase” in DerivedCoreProperties.txt: U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI is listed as “Lowercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt : 1F80..1F87; Lowercase # L [8] GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI But “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI” is *not* listed as “Uppercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . Although U+1F80 seems to be Uppercase according to http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt because it has a tolower mapping to U+1F80: 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345N;;;1F88;;1F88 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 0345N1F80; Is the information in DerivedCoreProperties.txt correct or could this be a bug in DerivedCoreProperties.txt? The above is not only the case for U+1F88, but for several more characters. All the characters listed below have a tolower mapping in http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt but are not listed in DerivedCoreProperties.txt as “Uppercase”: U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ Is that correct or a bug? -- Mike FABIAN mike.fab...@gmx.de 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about “Uppercase” in DerivedCoreProperties.txt
this is a feature of the Greek alphabet that the lowercase iota subscript can be capitalized in two different ways : either as a subscript below the uppercase main letter, or as a standard iota capitalized. The subscript form is a combining character, but not the non-subscript form. There shouls exist a special contextual rule for language specific casings, there's one already for the final sigma; but not the iota. It is not evident to handle and in fact the choice of case mapping is not specifically a lingusitic rule but a rendering style rule : for carved inscriptions, which are generally using only capitals, the combining forms are generally avoided and a reduced alphabet is used. For handwritten and cursive styles, the extended alphabet is used and this enables contextual forms including the small iota subscript and final small sigma an many combining signs (this also allows other placement rules for accents. For printing purpose or dispˆlay there's no rule, the document author enables or disables the extended alphabet (disabled geerally for rendering with small resolutions). The simple case mappngs however should preserve the distinctions present on the extended alphabet, but simple uppercasing text should not convert lowercase to all uppercase with an appended uppercase iota, even if this maps a lowercase letter to a titlecase one (it would be lossy, simplet casing rules should be lossless). case mappings in the ùain UCD however ignore the contextual rules and language-sˆpecific and style specific rules. But even if they are wrong this cannot be changed. The simple mappings in the main UCD file should not be assumed to be lossless. Actual case mappers do not use just these basic rules which are just the most frequent mappings assumed (anyway any kinds of case concersions introduces a loss, the degree of los is variable when mappings are not concerned by just a single pair of simple letters, see also the old difficulties about the German ess-tsett or sharp sign, and about many ligatures that became plain letters in some contexts, including the ampersand ' sign which originates from the et ligature, or the German umlaut which inherits some old behavior of the superscripted small latin letter e behaving like the Greek iota script in Fraktur font styles) 2014-11-06 16:55 GMT+01:00 Mike FABIAN maiku.fab...@gmail.com: I have a question about “Uppercase” in DerivedCoreProperties.txt: U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI is listed as “Lowercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt : 1F80..1F87; Lowercase # L [8] GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI But “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI” is *not* listed as “Uppercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . Although U+1F80 seems to be Uppercase according to http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt because it has a tolower mapping to U+1F80: 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345N;;;1F88;;1F88 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 0345N1F80; Is the information in DerivedCoreProperties.txt correct or could this be a bug in DerivedCoreProperties.txt? The above is not only the case for U+1F88, but for several more characters. All the characters listed below have a tolower mapping in http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt but are not listed in DerivedCoreProperties.txt as “Uppercase”: U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ Is that correct or a bug
RE: Question about Uppercase in DerivedCoreProperties.txt
Hello, The property Uppercase is a binary, informative property derived from General_Category (gc=Lu) and Other_Uppercase (OUpper=Y), as documented in Section 5.3 of UAX #44, at http://www.unicode.org/reports/tr44/#Uppercase. All of the characters you enumerated are titlecase letters (gc=Lt) rather than uppercase letters (gc=Lu), and they are not specifically assigned Other_Uppercase (which would otherwise contradict their General_Category). Following the derivation, they do not have the Uppercase binary property. For a visualization of the set of characters assigned the binary property Uppercase in relation to the set of Uppercase_Letter characters (gc=Lu), you can use the UnicodeSet comparison tool at http://www.unicode.org/cldr/utility/unicodeset.jsp. Enter “[:gc=Lu:]” in one input field and “[:Uppercase:]” in the other field, then click on Compare. Regards, L. -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Mike FABIAN Sent: Thursday, November 6, 2014 12:32 AM To: unicode@unicode.org Subject: Question about Uppercase in DerivedCoreProperties.txt I have a question about “Uppercase” in DerivedCoreProperties.txt: U+1F80 ᾀ GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI is listed as “Lowercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt : 1F80..1F87; Lowercase # L [8] GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI..GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI But “U+1F88 ᾈ GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI” is *not* listed as “Uppercase” in http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt . Although U+1F80 seems to be Uppercase according to http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt because it has a tolower mapping to U+1F80: 1F80;GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI;Ll;0;L;1F00 0345N;;;1F88;;1F88 1F88;GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI;Lt;0;L;1F08 0345N1F80; Is the information in DerivedCoreProperties.txt correct or could this be a bug in DerivedCoreProperties.txt? The above is not only the case for U+1F88, but for several more characters. All the characters listed below have a tolower mapping in http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt but are not listed in DerivedCoreProperties.txt as “Uppercase”: U+1F88 ᾈ has a tolower mapping to U+1F80 ᾀ U+1F89 ᾉ has a tolower mapping to U+1F81 ᾁ U+1F8A ᾊ has a tolower mapping to U+1F82 ᾂ U+1F8B ᾋ has a tolower mapping to U+1F83 ᾃ U+1F8C ᾌ has a tolower mapping to U+1F84 ᾄ U+1F8D ᾍ has a tolower mapping to U+1F85 ᾅ U+1F8E ᾎ has a tolower mapping to U+1F86 ᾆ U+1F8F ᾏ has a tolower mapping to U+1F87 ᾇ U+1F98 ᾘ has a tolower mapping to U+1F90 ᾐ U+1F99 ᾙ has a tolower mapping to U+1F91 ᾑ U+1F9A ᾚ has a tolower mapping to U+1F92 ᾒ U+1F9B ᾛ has a tolower mapping to U+1F93 ᾓ U+1F9C ᾜ has a tolower mapping to U+1F94 ᾔ U+1F9D ᾝ has a tolower mapping to U+1F95 ᾕ U+1F9E ᾞ has a tolower mapping to U+1F96 ᾖ U+1F9F ᾟ has a tolower mapping to U+1F97 ᾗ U+1FA8 ᾨ has a tolower mapping to U+1FA0 ᾠ U+1FA9 ᾩ has a tolower mapping to U+1FA1 ᾡ U+1FAA ᾪ has a tolower mapping to U+1FA2 ᾢ U+1FAB ᾫ has a tolower mapping to U+1FA3 ᾣ U+1FAC ᾬ has a tolower mapping to U+1FA4 ᾤ U+1FAD ᾭ has a tolower mapping to U+1FA5 ᾥ U+1FAE ᾮ has a tolower mapping to U+1FA6 ᾦ U+1FAF ᾯ has a tolower mapping to U+1FA7 ᾧ U+1FBC ᾼ has a tolower mapping to U+1FB3 ᾳ U+1FCC ῌ has a tolower mapping to U+1FC3 ῃ U+1FFC ῼ has a tolower mapping to U+1FF3 ῳ Is that correct or a bug? -- Mike FABIAN mfab...@redhat.commailto:mfab...@redhat.com ☏ Office: +49-69-365051027, internal 8875027 睡眠不足はいい仕事の敵だ。 ___ Unicode mailing list Unicode@unicode.orgmailto:Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Question about a Normalization test
Hi all, from the latest version of the standard, on line 16977 of the normalization tests, I am a bit confused by the NFC form. It appears incorrect to me. Here's the line, sans comment: 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062; Just looking at column 2, which according to the comments at the top is the NFC form: 0061 05AE 0305 0300 0315 0062: This, however, does not appear to be in NFC form. The first character, and the second or third characters do not compose. However, the first and fourth (0061 and 0300) do, composing to 00E0. Since there are no further compositions, the normalized form should be 00E0 05AE 0305 0315 0062 What am I missing? Thanks in advance for your help! Aaron ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about a Normalization test
On Thu, Oct 23, 2014 at 6:54 PM, Aaron Cannon cann...@fireantproductions.com wrote: 0061 05AE 0305 0300 0315 0062 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cu0061+%5Cu05AE+%5Cu0305+%5Cu0300+%5Cu0315+%5Cu0062g=ccc 0305 and 0300 have the same ccc, so the first one blocks the second. http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G49576 The older spec is shorter, although not as precise: http://www.unicode.org/reports/tr15/tr15-29.html#Specification Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Question about a Normalization test
Aaron Cannon asked: Hi all, from the latest version of the standard, on line 16977 of the normalization tests, I am a bit confused by the NFC form. It appears incorrect to me. Here's the line, sans comment: 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062; Just looking at column 2, which according to the comments at the top is the NFC form: 0061 05AE 0305 0300 0315 0062: This, however, does not appear to be in NFC form. The first character, and the second or third characters do not compose. However, the first and fourth (0061 and 0300) do, composing to 00E0. Since there are no further compositions, the normalized form should be 00E0 05AE 0305 0315 0062 What am I missing? Input is: Code points: 0061 0305 0315 0300 05AE 0062 Ccc:0 230 232 230 2280 Output of canonical reordering is: Code points: 0061 05AE 0305 0300 0315 0062 Ccc:0 228 230 230 2320 Next step is to start from 0061 and test each successive combining mark, looking for composition candidates. 0061 does not compose with 05AE. 0061 does not compose with 0305. 0061 *could* compose with 0300 (00E0 = 0061 + 0300), *but* 0300 is *blocked* from 0061 by the intervening combining mark 0305 with the *same* ccc value as 0300. So the composition does not occur. 0061 does not compose with 0315. The next character is 0062, ccc=0, a starter, so we are done. For the relevant definitions, see: http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G50628 and scroll down a couple pages to D115 on p. 139. Test cases like this are included in NormalizationTest.txt precisely to ensure that implementations are correctly detecting these sequences where composition is blocked. --Ken ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about a Normalization test
On 10/23/14, Whistler, Ken ken.whist...@sap.com wrote: Test cases like this are included in NormalizationTest.txt precisely to ensure that implementations are correctly detecting these sequences where composition is blocked. And I am indeed glad that they are, as I completely missed this small but critical detail. Thanks so much all! Aaron ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Question about WordBreak property rules
http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should be no break between a Hebrew_Letter and a Single_Quote even if what follows is not a Hebrew_Letter. This is not contradictory, but it is suspicious. It makes me wonder if there is an error in the specification. Assuming there is not, then rule 7a ought to be before current rule 6, which itself should be divided so that there isn't redundant specification of the Hebrew_Letter rules. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about WordBreak property rules
On 07/24/2014 01:38 PM, Karl Williamson wrote: http://www.unicode.org/draft/reports/tr29/tr29.html#WB6 indicates that there should be no break between the first two letters in the sequence Hebrew_Letter Single_Quote Hebrew_Letter. However, rule 7a just below indicates that there should be no break between a Hebrew_Letter and a Single_Quote even if what follows is not a Hebrew_Letter. This is not contradictory, but it is suspicious. It makes me wonder if there is an error in the specification. Assuming there is not, then rule 7a ought to be before current rule 6, which itself should be divided so that there isn't redundant specification of the Hebrew_Letter rules. In reading this after I sent it, I'm not sure I was clear enough. Rule 6 implies that you need additional context to decide whether to break between a Hebrew_Letter followed by a Single_Quote. Yet Rule 7a says that you don't need any additional context; you break always. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
question to Akkadian
Folks, I'm trying to find an encoding of the following Akkadian cuneiform: ___ ___ ___ \ / \ / \ / ||| | /| | /| | | \| | \| | ||| |\___ |/ My knowledge of cuneiforms is zero, but I can read Unicode tables :-) However, I haven't found it in the Akkadian cuneiforms block. Either I've missed it, or it gets represented as a ligature, or ... In case it is a ligature: Where should I look to find well drawn glyphs? Or to formulate it more generally: If I have a cuneiform text, where can I find glyph images to identify them? Werner ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: question to Akkadian
On May 19, 2014, at 8:40 AM, Werner LEMBERG wrote: If I have a cuneiform text, where can I find glyph images to identify them? You might want to specify what you mean by text. A photo of an inscription? Something from a printed book? Because of the considerable variation in glyphs over the long time period when this script was used, you may need to consult a reference that tries to cover that, like Labat's Manuel d'Épigraphie Akkadienne. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: question to Akkadian
If I have a cuneiform text, where can I find glyph images to identify them? You might want to specify what you mean by text. A photo of an inscription? Something from a printed book? I'm interested in representing one of the so-called Hurrian songs (tablet H.6, containing musical notation) with Unicode, cf. https://en.wikipedia.org/wiki/Hurrian_songs A much better drawing of the tablet can be found here on page 503: http://digital.library.stonybrook.edu/cdm/ref/collection/amar/id/7250 The character in question is the first one on the left after the double line. A nice article on this song can be found here: http://individual.utoronto.ca/seadogdriftwood/Hurrian/Website_article_on_Hurrian_Hymn_No._6.html Werner ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: question to Akkadian
On May 19, 2014, at 9:21 AM, Werner LEMBERG wrote: I'm interested in representing one of the so-called Hurrian songs (tablet H.6, containing musical notation) with Unicode, cf. https://en.wikipedia.org/wiki/Hurrian_songs That says it represents qáb, which seems to be a version of Labat 88, which is U+1218F KAB. Unfortunately none of my fonts give the version shown in that drawing, but there may be one. Photo from Labat attached. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: question to Akkadian
I'm interested in representing one of the so-called Hurrian songs (tablet H.6, containing musical notation) with Unicode, cf. https://en.wikipedia.org/wiki/Hurrian_songs That says it represents qáb, which seems to be a version of Labat 88, which is U+1218F KAB. Unfortunately none of my fonts give the version shown in that drawing, but there may be one. Thanks a lot! Will try to get the book you've mentioned... BTW, it seems to me that cuneiforms would benefit enormously by introducing variant selectors, collecting all cuneiform variants in a database similar to the CJK stuff. Werner ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Fwd: Terminology question re ASCII
Sorry, should have cc:d the list. Assume original mail was from a list member. -- Forwarded message -- From: Christopher Vance cjsva...@gmail.com Date: 29 October 2013 16:58 Subject: Re: Terminology question re ASCII To: Mark Davis ☕ m...@macchiato.com Of course, once you have 8-bit characters in the upper range from 0x80 up, you can only know intrinsically that it's not actually ASCII, and that anybody who says it is, is probably lying. You can only determine the actual character set used from extrinsic information. Is the 8th bit just parity? Is it a Microsoft set with those graphical things? Is it one of the Latin-N sets (which one)? EBCDIC? Something else? On 29 October 2013 16:38, Mark Davis ☕ m...@macchiato.com wrote: Normally the term ASCII just refers to the 7-bit form. What is sometimes called 8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, you can say 7-bit ASCII. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote: Quick question on terminology use concerning a legacy encoding: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended ASCII)? I've always used the term ASCII in the 7-bit, 128 character sense, and modifying it with plain seems to reinforce that sense. (Although plain text in my understanding actually refers to lack of formatting.) Reason for asking is encountering a reference to plain ASCII describing text that clearly (by presence of accented characters) would be 8-bit. The context is one of many situations where in attaching a document to an email, it is advisable to include an unformatted text version of the document in the body of the email. Never mind that the latter is probably in UTF-8 anyway(?) - the issue here is the terminology. TIA for any feedback. Don Osborn Sent via BlackBerry by ATT -- Christopher Vance -- Christopher Vance
Re: Terminology question re ASCII
2013-10-29 6:12, d...@bisharat.net wrote: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended ASCII)? In correct usage, “ASCII” refers to a specific standard, namely “American National Standard for Information Systems - Coded Character Sets - 7-Bit American National Standard Code for Information Interchange (7-Bit ASCII)”, ANSI X3.4-1986, except in historical presentations, where it might refer to predecessors of that standard (earlier versions of ASCII). In common usage, “ASCII” is also used to denote a) text data in general, b) some 8-bit encoding that has ASCII characters as its 7-bit subset, and c) other things. This can be very confusing, and that’s why the standard has the parenthetic note “7-Bit ASCII” and why people often use “US-ASCII” as the name of the ASCII encoding. The clarifying prefixes are, however, also misleading in the sense that they suggests the existence of other ASCIIs. I've always used the term ASCII in the 7-bit, 128 character sense, and modifying it with plain seems to reinforce that sense. (Although plain text in my understanding actually refers to lack of formatting.) The attribute “plain” probably refers to plain text in the contexts given. Once people make the mistake of writing “ASCII” when they mean “text”, further confusion will be caused by attributes like “plain”, which are indeed ambiguous. Reason for asking is encountering a reference to plain ASCII describing text that clearly (by presence of accented characters) would be 8-bit. It probably means “plain text”. But it could also mean “text in an 8-bit encoding”, if the author thinks of encodings like ISO 8859-1, windows-1252, ISO 8859-2, cp-850, Mac Roman, etc., as “extended ASCII” and even drops the attribute “extended”. It is conceivable that “plain ASCII” is even used to emphasize that the text is not in a Unicode encoding. The context is one of many situations where in attaching a document to an email, it is advisable to include an unformatted text version of the document in the body of the email. Never mind that the latter is probably in UTF-8 anyway(?) - the issue here is the terminology. The proper term for plain text is “plain text”. The word “unformatted” is often used, and might be seen as intuitively descriptive (unformatted, as opposite to text that contains formatting like bolding, colors, and different fonts), but it is risky. For one thing, plain text is often displayed “as is” with respect to line breaks and indentation, i.e. as “preformatted” (as in pre elements in HTML). Moreover, text that is not plain text need not be formatted. It could be e.g. an XML file where XML tags are used to mark up structural parts of the text, without causing or implying any specific formatting in rendering. Yucca
Re: Terminology question re ASCII
On Mon, Oct 28, 2013 at 10:38 PM, Mark Davis ☕ m...@macchiato.com wrote: Normally the term ASCII just refers to the 7-bit form. What is sometimes called 8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, you can say 7-bit ASCII. One of the first hits for 8-bit ASCII on Google Books is When the Mac came out. it supported 8-bit ASCII., courtesy of Introduction to Digital Publishing, by David Bergsland. (He also seems to be under the delusion that MS-DOS used 7-bit ASCII.) I don't think you can assume anything about 8-bit ASCII besides the lower bits (hopefully) begin compatible with ASCII. -- Kie ekzistas vivo, ekzistas espero.
Re: Terminology question re ASCII
8-bit ASCII is not so clear ! The reason for that is the historic documentation of many softwares, notably for the BASIC language, or similar like Excel, or even more recent languages like PHP, offering functions like CHR$(number) and ASC(string) to convert a string to the numeric 8-bit ASCII code of its first character or the reverse. The effective encoding of strings was in fact not specified at all and could be any 8-bit encoding used on the platform. Only in more recent versions of implementtions of these languages, they specify that the encoding of their strings is now based on Unicode (most often UTF-16, so that 8-bit values now produce the same result as ISO-8859-1), but this is not enforced if a compatibility working mode was kept (e.g. in PHP which still uses unspecified 8-bit encodings for its strings in most of its API, or in Python that distinguishes types for 8-bit encoded strings and Unicode-encoded strings). 2013/10/29 Mark Davis ☕ m...@macchiato.com Normally the term ASCII just refers to the 7-bit form. What is sometimes called 8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, you can say 7-bit ASCII. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote: Quick question on terminology use concerning a legacy encoding: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended ASCII)? I've always used the term ASCII in the 7-bit, 128 character sense, and modifying it with plain seems to reinforce that sense. (Although plain text in my understanding actually refers to lack of formatting.) Reason for asking is encountering a reference to plain ASCII describing text that clearly (by presence of accented characters) would be 8-bit. The context is one of many situations where in attaching a document to an email, it is advisable to include an unformatted text version of the document in the body of the email. Never mind that the latter is probably in UTF-8 anyway(?) - the issue here is the terminology. TIA for any feedback. Don Osborn Sent via BlackBerry by ATT
RE: Terminology question re ASCII
I would concur. When I hear “8 bit ASCII” the context is usually confusing the term with any of what we call “ANSI Code Pages” in Windows. (or similar ideas on other systems). It’s also usually the prelude to a conversation asking the requestor to back up 5 or 6 steps and explain what they’re really trying to do because something’s probably a bit confused. -Shawn From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Philippe Verdy Sent: Tuesday, October 29, 2013 7:49 AM To: Mark Davis ☕ Cc: Donald Z. Osborn; unicode Subject: Re: Terminology question re ASCII 8-bit ASCII is not so clear ! The reason for that is the historic documentation of many softwares, notably for the BASIC language, or similar like Excel, or even more recent languages like PHP, offering functions like CHR$(number) and ASC(string) to convert a string to the numeric 8-bit ASCII code of its first character or the reverse. The effective encoding of strings was in fact not specified at all and could be any 8-bit encoding used on the platform. Only in more recent versions of implementtions of these languages, they specify that the encoding of their strings is now based on Unicode (most often UTF-16, so that 8-bit values now produce the same result as ISO-8859-1), but this is not enforced if a compatibility working mode was kept (e.g. in PHP which still uses unspecified 8-bit encodings for its strings in most of its API, or in Python that distinguishes types for 8-bit encoded strings and Unicode-encoded strings). 2013/10/29 Mark Davis ☕ m...@macchiato.commailto:m...@macchiato.com Normally the term ASCII just refers to the 7-bit form. What is sometimes called 8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, you can say 7-bit ASCII. Markhttps://plus.google.com/114199149796022210033 — Il meglio è l’inimico del bene — On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.netmailto:d...@bisharat.net wrote: Quick question on terminology use concerning a legacy encoding: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended ASCII)? I've always used the term ASCII in the 7-bit, 128 character sense, and modifying it with plain seems to reinforce that sense. (Although plain text in my understanding actually refers to lack of formatting.) Reason for asking is encountering a reference to plain ASCII describing text that clearly (by presence of accented characters) would be 8-bit. The context is one of many situations where in attaching a document to an email, it is advisable to include an unformatted text version of the document in the body of the email. Never mind that the latter is probably in UTF-8 anyway(?) - the issue here is the terminology. TIA for any feedback. Don Osborn Sent via BlackBerry by ATT
Re: Terminology question re ASCII
2013/10/29 Shawn Steele shawn.ste...@microsoft.com I would concur. When I hear “8 bit ASCII” the context is usually confusing the term with any of what we call “ANSI Code Pages” in Windows. (or similar ideas on other systems). Of course not just Windows (or MSDOS). This was seen as well in vrious early OSes for personnal computers from various brands nd various countries (not just US like Atari, but as well from Japan, France, Germany, UK, Sweden and certainly others, where neither the US-only ASCII or ANSI were standard). We've also seen these documents speaking bout US-ASCII when they actually meant an 8-bit encoding whose lower 7-bit part matched ISO 646 for US (i.e. the real ASCII standard from ANSI). Due to Windows however (also in IBM OS/2, IBM DOS, and other derived OSes by Digital Research for example, and also in some brands of Unix, CPM, VMS... as well as in early development/porting for Linux), the ambiguity rose when people started to speak about ANSI as an encoding when it actully standard body developing various standards (including for other encodings), and later this was corrected (not in Windows which uses the incorrect terms ANSI codepage when none of them were actaully coming from ANSI but from Microsoft, IBM, or some other national bodies, and later modified by Microsoft !) by simply using ASCII instead of ANSI, when they should have just spoken of **some** range of 8-bit encodings supported by the underlying OS whose lower 7-bit part was more or less based on some national version of ISO 646 (or sometimes only in its invariant part, excluding significant parts reserved for C0 controls but tweaked to encode printable characters, e.g. for in VISCII or in IBM PC codepages for DOS). 7-bit and 8-bit encodings have always been a mess to reference, with frequently ambiguous or wrong names, and many aliases being developed when trying to disambiguate them (e.g. the IBM and Microsoft numeric codepages, later aliased again on other systems !). This lead to the creation of an international registry for encoding identifiers to fix the recommended idenfiers for interchange and deprecate the other aliases (but Microsoft never used it directly, it continued using its own numeric codepages, and just accepted a few named aliases, sometimes incorrectly, for example when Microsoft Frontpage confused and aliased ISO-8859-1 and windows-1252, changing them in incomptible ways, forcing now HTML5 do declare that ISO-8859-1 is no longer this standard but windows-1252).
Re: Terminology question re ASCII
Normally the term ASCII just refers to the 7-bit form. What is sometimes called 8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, you can say 7-bit ASCII. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote: Quick question on terminology use concerning a legacy encoding: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended ASCII)? I've always used the term ASCII in the 7-bit, 128 character sense, and modifying it with plain seems to reinforce that sense. (Although plain text in my understanding actually refers to lack of formatting.) Reason for asking is encountering a reference to plain ASCII describing text that clearly (by presence of accented characters) would be 8-bit. The context is one of many situations where in attaching a document to an email, it is advisable to include an unformatted text version of the document in the body of the email. Never mind that the latter is probably in UTF-8 anyway(?) - the issue here is the terminology. TIA for any feedback. Don Osborn Sent via BlackBerry by ATT
Re: UTF-8 ill-formed question
Hello, am 2012-12-15 schrieb Philippe Verdy: But there's still a bug (or request for enhancement) for your Pocket converters : - For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates) from the sets of convertible codepoints. - But you don't exclude this range in the case of your UTF-8 and UTF-32 magic encoders which could forget this case. Of course your encoder would create distinct sequences for these code points, but they are not valid UTF-8 or valid UTF-32 encodings. Only the UTF-16 variant is really *my* “magic pocket encoder” (MPE); the author is nominated on every one of the three. I would not demand more from those MPEs than converting a valid UCS character to a valid, and equivalen, UTF sequence – and to illustrate the underlying algorithm. I guess, originally, they were meant as jokes – partially, at least; I have used them as a didactic device, in my beginner's lecture in Unicode. Clearly, Mike Ayers made the point that the UTF-32 encoding is nothing but a simple shortcut (in the terms of its two predecessors). His one-row-only MPE expresses this quite aptly, and any additional branch would spoil the impression. The reason I excluded the surrogates from my UTF-8 MPE was really that I needed additional space for the user’s guide on the reverse side. Cheers, Otto Stolz
Re: UTF-8 ill-formed question
2012/12/16 Otto Stolz otto.st...@uni-konstanz.de The reason I excluded the surrogates from my UTF-8 MPE was really that I needed additional space for the user’s guide on the reverse side. Why adding a row in the front side would have not preserved the space for the reverse side ? If this is regarded as didactic tool, addin this row would have focused more on the validity constraint of UTF-8, enforced in TUS and now as well in the IETF RFC made by ISO to be fully compatible with TUS. I think that the row was missing only because your MPE was initially designed for the old UTF-8 definition in the now obsolete ISO definition where the validity constraint was not clear (it was not clear as well on past variations of UTF-8 that are still existing in Java (not really for plain-text interchange but for the 8-native JNI API compatible with 8-bit C strings, and as part of the serialization format of compiled Java classes). Add this missing row, Everything in the reverse side can remain the same (or can be using a less cryptic compact description of how it works).
Re: UTF-8 ill-formed question
Hello, 2012/12/16 Otto Stolz otto.st...@uni-konstanz.de The reason I excluded the surrogates from my UTF-8 MPE was really that I needed additional space for the user’s guide on the reverse side. Sorry, typo; I meant: “my UTF-16 MPE”. I added that extra row (with the branch excluding the surrogates) to gain extra space on the reverse sode. Am 2012-12-16 schrieb Philippe Verdy: Add this missing row, Everything in the reverse side can remain the same (or can be using a less cryptic compact description of how it works). I will certainly not change Marco Cimarosti’s original design of his UTF-8 MPE. Best wishes, Otto Stolz
Re: UTF-8 ill-formed question
But the old Marco design at that time (2002) was still ignoring the Unicode UTF-8 conformance constraints, as demonstrated in its use of the obsolete U-00n notation (mathcing the obsolete ISO/IETF definition). If the puprpose of this pocket conversion card is to be used for tutorial purpose, omitting the validity constraint is not very didactic and could continue to cause compatibility troubles if theses rules are not exposed and learnt, and consequently ignored in applications. Note that in my previous post, I dropped the extra leading zeroes in Marco's use of the obsolete U-00n notation of supplementary codepoints, but I forgot to change the U- prefix into U+ for these supplementary code points. Sorry about that. Of course there are better ways to present this card to something that will be printed (then placed under a reusable plastic cover, like an identity card or driver licence card, and the size of a credit card for your jacket), using HTML or PDF instead of just this basic plain-text format. The usage instructions on the back side would also be clearer, and there would be additional visual hints to make it more obvious. And you would be less restricted for drawing the diagram without using the ugly characters of box framing symbols (only usable with monospaced fonts which are ugly for presenring the instructions). The pocket card would also use background colors to better exhibit an all white frame where you need to write something (better than using a dot), and what is fixed in the layout. There are also other possible presentations, if printing a similar tool on a carton : just use rotating wheels (1 for VW, 1 for X, 1 for Y, you may ignore the Z wheel which will display the same value in the input and in the output window) and a front masking carton with windows showing the input and the result of the conversion ! You don't need any pen, it's reusable, simpler and faster to use. 2012/12/16 Doug Ewell d...@ewellic.org I remember Marco's original post in 2002. His intent was to give people with an actual U+ code point that needed converting—like James Lin ten years later—a quick way to do so without getting immersed in all the bit-shifting math. If this were a routine being run by a computer, or a tutorial on UTF-8, I would agree that it should have taken loose surrogates into account. But it's not. It's just a quick manual reference guide, and loose surrogates are 0.0001% of the real-world problem for users like James. While I note that Philippe's amended version seems straightforward and in keeping with Marco's original intent (short and simple), I'd like to suggest that neither Marco for creating the original guide, nor anyone else for doing up UTF-16 and UTF-32 versions, nor Otto for reposting them on the list this week, need to be beaten up any further over this edge case. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: UTF-8 ill-formed question
Philippe Verdy wrote: If the puprpose of this pocket conversion card is to be used for tutorial purpose, It never was. It was a quick reference guide for experienced users who already understood the caveats. Not worth arguing further. -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: UTF-8 ill-formed question
Hello, am 2012-12-11 20:16, schrieb James Lin: If i have a code point: U+4E8C or 二 In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this BA comes from? Cf. http://skew.org/cumped/. Enclosed are the (almost original) version of “Cima’s Magic UTF-8 Pocket encoder” (2004), and its two followers for more UTFs. Display or print with a fixed-pitch font, such as Lucida Console or Courier New. Enjoy! Cheers, Otto Stolz Side 1 (print and cut out): ++---+---+--+ | U+ | yy zz |Cima's UTF-8 Magic | Hex= | | U+007F | ! ! |Pocket Encoder | B-4 | | YZ | . . | | | ++---+---+ Vers. 1.1 | 0=00 | | U+0080 | 3x xy | 2y zz | 30 June 2004 | 1=01 | | U+07FF | 3. .. | 2. ! | | 2=02 | |XYZ | . . | . . | M.C. | 3=03 | ++---+---+---+ | 4=10 | | U+0800 | 32 ww | 2x xy | 2y zz | | 5=11 | | U+ | ! ! | 2. .. | 2. ! | | 6=12 | | WXYZ | E . | . . | . . | | 7=13 | ++---+---+---+---+ 8=20 | | U-0001 | 33 0v | 2v ww | 2x xy | 2y zz | 9=21 | | U-000F | ! 0. | 2. ! | 2. .. | 2. ! | A=22 | | VWXYZ | F . | . . | . . | . . | B=23 | ++---+---+---+---+ C=30 | | U-0010 | 33 10 | 20 ww | 2x xy | 2y zz | D=31 | | U-0010 | ! 1. | 2. ! | 2. .. | 2. ! | E=32 | | WXYZ | F 4 | 8 . | . . | . . | F=33 | ++---+---+---+---+--+ Side 2 (print, cut out, and glue on back of side 1): +---+ | Cima's UTF-8 Magic Pocket Encoder - User's Manual | | (vers. 1.1, 30 June 2004, by Marco Cimarosti) | | | | - Left column: min and max Unicode scalar values: | | pick the row that applies to the code point you | | want to convert to UTF-8. Letters V..Z mark the | | hexadecimal digits that have to be processed. | | - Right column: hexadecimal to base-4 table. | | - Central columns: work area to compute each octet| | (1 to 4) that constitute UTF-8 octet sequences. | | Convert each digit marked by V..Z from hex. to| | b.-4. Write b.-4 digits on the dots placed under | | letters v..z (two b.-4 digits per hex. digit).| | Convert 2-digit base-4 number to hex. digits and | | write them on the dots on the line. That is your | | UTF-8 sequence in hex. ! Exclamation marks show | | passages that may be skipped, either because the | | digit is hard-coded, or because it may be copied | | directly from the scalar value. | +---+ Enjoy! Marco Obverse: Print with a fixed-width font, such as Lucida Console, and cut out. ╔╦═╦═╗ ║ U+ ║ W X Y Z ║ Otto’s Magic Pocket Encoder ║ ║ U+D7FF ║ ! ! ! ! ║ for UTF-16 ╔═══╣ ║ WXYZ ║ _ _ _ _ ║ ║Vvv │Vvv ║ ╟╫─╢ Version 1.1 ║Uuu │Uuu ║ ║ U+E000 ║ W X Y Z ║ ©2004-07-05 ║ ttT│ ttT║ ║ U+ ║ ! ! ! ! ║ ║___ │___ ║ ║ WXYZ ║ _ _ _ _ ║ ║ ┼ ║ ╟╫─╚═╣0=00 │ 138=20 ║ ║ U-0001 ║ 31 2t tu uv │ 31 3v Y Z ║ 001=01 │ 209=21 ║ ║ U-000F ║ ! 2_ __ __ │ ! 3_ ! ! ║ 012=02 │ 21A=22 ║ ║ TUVYZ ║ D _ _ _ │ D _ _ _ ║ 023=03 │ 22B=23 ║ ╟╫─┼─╢ 034=10 │ 23C=30 ║ ║ U-0010 ║ 31 23 3u uv │ 31 3v Y Z ║ 105=11 │ 30D=31 ║ ║ U-0010 ║ ! ! 3_ __ │ ! 3_ ! ! ║ 116=12 │ 31E=32 ║ ║ UVYZ ║ D B _ _ │ D _ _ _ ║ 127=13 │ 32F=33 ║ ╚╩═╧═╩═══╝ :1:2:3:4:5:6.. Reverse: Cut out and paste on back of obverse. ╔╗ ║ Otto’s Magic Pocket Encoder for UTF-16 Version 1.1 ║ ║ User’s Manual (inspired from Cima’s UTF-8 MPE) ║ ╠╣ ║• Left column: min and max Unicode scalar values: pick the ║ ║ row that applies to the code point to be converted. ║ ║ T…Z mark the hexadecadic digits that have to be processed.║ ║• Central column: work area to compute UTF-16BE code units. ║ ║• Right column: hexadecadic to quaternary conversion tables:║ ║ for T to tt; = for U/V to uu/vv (step 1) and for step 2.║ ║1. Convert each digit marked by T…V from hex to quat. Write ║ ║ quat digits on the underscores placed under letters t…v. ║ ║2. Convert 2-digit quat numbers to hex digits or copy digits║ ║ W…Z, as indicated, and write them on the underscores on ║ ║ the next line. That’s your UTF-16BE sequence in hex. ║
Re: UTF-8 ill-formed question
On 12/11/2012 11:50 AM, vanis...@boil.afraid.org wrote: From: James Lin James_Lin_at_symantec.com Hi Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow the pattern of UTF-8 byte-sequences, i just wondering how or why? If i have a code point: U+4E8C or 二 In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this BA comes from? thanks -James Each of the UTF encodings represents the binary data in different ways. So we need to break the scalar value, U+4E8C, into its binary representation before we proceed. 4E8C - 0100 1110 1000 1100 Then, we need to look up the rules for UTF-8. It states that code points between U+800 and U+ are encoded with three bytes, in the form 1110 10xx 10xx. So plugging in our data, we get 4 E8 C 0100 1110 10-00 1100 // \\ + 1110 10xx 10xx = 11100100 10111010 10001100 or E 4 B A 8 C -Van Anderson Nice! A./ PS: I fixed a missing \
Re: UTF-8 ill-formed question
thank you so much everyone for explaining it. I got it now! -James On 12/11/12 11:50 AM, vanis...@boil.afraid.org vanis...@boil.afraid.org wrote: From: James Lin James_Lin_at_symantec.com Hi Does anyone know why ill-form occurred on the UTF-8? besides it doesn't follow the pattern of UTF-8 byte-sequences, i just wondering how or why? If i have a code point: U+4E8C or 二 In UTF-8, it's E4 BA 8C while in UTF-16, it's 4E8C. Where is this BA comes from? thanks -James Each of the UTF encodings represents the binary data in different ways. So we need to break the scalar value, U+4E8C, into its binary representation before we proceed. 4E8C - 0100 1110 1000 1100 Then, we need to look up the rules for UTF-8. It states that code points between U+800 and U+ are encoded with three bytes, in the form 1110 10xx 10xx. So plugging in our data, we get 4 E8 C 0100 1110 10-00 1100 // \ + 1110 10xx 10xx = 11100100 10111010 10001100 or E 4 B A 8 C -Van Anderson
Question about normalization tests
Hi there, I'm going through the NormalizationTests.txt in the 6.3.0d1 database, and I ran across this line: 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062; # (a◌̅◌̕◌̀◌֮b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; a◌֮◌̅◌̀◌̕b; ) LATIN SMALL LETTER A, COMBINING OVERLINE, COMBINING COMMA ABOVE RIGHT, COMBINING GRAVE ACCENT, HEBREW ACCENT ZINOR, LATIN SMALL LETTER B The relevant parts for my question are: Source: 0061 0305 0315 0300 05AE 0062 NFD: 0061 05AE 0305 0300 0315 0062 NFC: 0061 05AE 0305 0300 0315 0062 I agree with the NFD decomposition result, but the NFC one seems wrong to me. If you look at rule D117 in the Unicode Spec http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf (I couldn't find the spec for 6.3 -- hopefully 6.2 is close enough), it gives the algorithm for NFC composition. The way I interpret it, this is how the composition proceeds: Starting with the NFD decomposition string, we retrieve the combining classes for each character from the UnicodeData.txt file: 0061 - 0 05AE - 228 0305 - 230 0300 - 230 0315 - 232 0062 - 0 You start at the first character after the starter (0061, with ccc=0), which is 05AE. There is no primary composition for the sequence 0061 05AE, so you move on. Looking at 0305, it is not blocked from 0061, so check the primary composition for 0061 0305. There is none for that either, so move on. Looking at 0300, it is also not blocked from 0061, so check the primary composition for 0061 0300. There is a primary composition for that sequence, 00E0, so replace the starter with that, delete the 0300, and continue. The string looks like this now: 00E0 - 0 05AE - 228 0305 - 230 0315 - 232 0062 - 0 Checking 0315 and 0062, they are not blocked, but there is no composition with 00E0, so the algorithm ends with the result: 00E0 05AE 0305 0315 0062 This disagrees with what it says in the normalization tests file as listed above. The question is, did I misunderstand the algorithm, or is this perhaps a bug in the data file? Thanks, Edwin
Re: Question about normalization tests
0300 *is* blocked, because there is a preceding character (0305) that has the same combining class (230). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Dec 10, 2012 at 11:55 AM, Edwin Hoogerbeets ehoogerbe...@gmail.comwrote: Looking at 0300, it is also not blocked from 0061, so check the primary composition for 0061 0300. There is a primary composition for that sequence, 00E0, so replace the starter with that, delete the 0300, and continue. The string looks like this now:
RE: Question about normalization tests
Your misunderstanding is at the highlighted statement below. Actually 0300 *is* blocked from 0061 in this sequence, because it is preceded by a character with the same canonical combining class (i.e. U+0305, ccc=230). A blocking context is the preceding combining character either having ccc=0 or having ccc greater than or equal to the character being checked. --Ken Starting with the NFD decomposition string, we retrieve the combining classes for each character from the UnicodeData.txt file: 0061 - 0 05AE - 228 0305 - 230 0300 - 230 0315 - 232 0062 - 0 You start at the first character after the starter (0061, with ccc=0), which is 05AE. There is no primary composition for the sequence 0061 05AE, so you move on. Looking at 0305, it is not blocked from 0061, so check the primary composition for 0061 0305. There is none for that either, so move on. Looking at 0300, it is also not blocked from 0061, so check the primary composition for 0061 0300. There is a primary composition for that sequence, 00E0, so replace the starter with that, delete the 0300, and continue. The string looks like this now: 00E0 - 0 05AE - 228 0305 - 230 0315 - 232 0062 - 0 Checking 0315 and 0062, they are not blocked, but there is no composition with 00E0, so the algorithm ends with the result: 00E0 05AE 0305 0315 0062 This disagrees with what it says in the normalization tests file as listed above. The question is, did I misunderstand the algorithm, or is this perhaps a bug in the data file? Thanks, Edwin
Fwd: Re: Question about normalization tests
Ah yes, I did indeed miss the equal to part. I fixed up my code and now it works as expected. Thanks to Mark and Ken for your help and speedy response! Edwin On 12/10/2012 12:57 PM, Whistler, Ken wrote: Your misunderstanding is at the highlighted statement below. Actually 0300 **is** blocked from 0061 in this sequence, because it is preceded by a character with the same canonical combining class (i.e. U+0305, ccc=230). A blocking context is the preceding combining character either having ccc=0 or having ccc greater than *or equal to* the character being checked. --Ken
A question about the default grapheme cluster boundaries with U+0020 as the grapheme base
It seems like there is an inconsistency between what the default grapheme clusters specification says and what the test results are expected to be: The UAX#29 says: Another key feature (of default Unicode grapheme clusters) is that bdefault Unicode grapheme clusters are atomic units with respect to the process of determining the Unicode default line, word, and sentence boundaries/b. Also this mentioned in UAX#14: Example 6. Some implementations may wish to tailor the line breaking algorithm to resolve grapheme clusters according to Unicode Standard Annex #29, “Unicode Text Segmentation” [UAX29], as a first stage. bGenerally, the line breaking algorithm does not create line break opportunities within default grapheme clusters/b; therefore such a tailoring would be expected to produce results that are close to those defined by the default algorithm. However, if such a tailoring is chosen, characters that are members of line break class CM but not part of the definition of default grapheme clusters must still be handled by rules LB9 and LB10, or by some additional tailoring. However, U+0020 (SP), U+0308 (CM) in the line breaking algorithm is handled by the rules LB10+LB18 and produces a break opportunity while GB9 prohibits break between U+0020 (Other), U+0308 (Entend). Section 9.2 Legacy Support for Space Character as Base for Combining Marks in UAX#29 clarifies why there is a line break occurs, but the fact that the statements above are false statements and introduce some ambiguility. In case the space character is not a grapheme base anymore the grapheme cluster breaking rules need to be updated. Kind regards, Konstantin
Re: Question on U+33D7
Grandpa grandpa I wanna hear the story about the turtles *now*! :-) Sent from my Android phone
Re: Question on U+33D7
On Fri, Feb 24, 2012 at 5:18 AM, Shriramana Sharma samj...@gmail.com wrote: Grandpa grandpa I wanna hear the story about the turtles *now*! :-) Sent from my Android phone Thanks all for the enlightening reply. My intent was sorting using UCA but it really did not matter much because U+33D7 was sorted after PH in either case (0050 0048 or 0070 0048“). I was curious why U+33D7 was defined and stayed that way in Unicode, and it was answered more than comprehensively. Regards, Matt
Question on U+33D7
It is defined as 33D7;SQUARE PH;So;0;L;square 0050 0048N;SQUARED PH in UnicodeData.txt, but it is shown as pH in code chart. Should it be 0070 0048 or PH? Thanks, Matt
Re: Question on U+33D7
On 2012/2/23 Matt Ma matt.ma.um...@gmail.com wrote: It is defined as 33D7;SQUARE PH;So;0;L;square 0050 0048N;SQUARED PH in UnicodeData.txt, but it is shown as pH in code chart. Should it be 0070 0048 or PH? It should certainly be pH, i.e., square0070 0048/square, because that's the peculiar casing in widespread (universal, really) use for this basic Chemistry concept (AFAIK it means power of Hidrogen). See http://en.wikipedia.org/wiki/pH#History . While there's no surprise at PH Unicode names being all caps, I’m surprised that the decomposition mapping is wrongly set to 0050 0048 instead of to 0070 0048. -- . António MARTINS-Tuválkin | ()| tuval...@gmail.com Não me invejo de quem tem || PT-1500-111 LISBOA carros, parelhas e montes | +351 934 821 700, +351 212 463 477 só me invejo de quem bebe | facebook.com/profile.php?id=744658416 a água em todas as fontes | - De sable uma fonte e bordadura escaqueada de jalde e goles, por timbre a bandeira, por mote o 1º verso acima, e por grito de guerra Mi rajtas!. -
Re: Question on U+33D7
On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote: On 2012/2/23 Matt Mamatt.ma.um...@gmail.com wrote: It is defined as 33D7;SQUARE PH;So;0;L;square 0050 0048N;SQUARED PH in UnicodeData.txt, but it is shown as pH in code chart. Should it be 0070 0048 or PH? It should certainly be pH, i.e., square0070 0048/square, because that's the peculiar casing in widespread (universal, really) use for this basic Chemistry concept (AFAIK it means power of Hidrogen). See http://en.wikipedia.org/wiki/pH#History. While there's no surprise at PH Unicode names being all caps, I’m surprised that the decomposition mapping is wrongly set to 0050 0048 instead of to 0070 0048. The early fonts and code tables showed this in all caps. Unfortunately, mappings are frozen - including mistakes. One of the many reasons not to use NFKD or NFKC for transforming data - these transformations should be limited to dealing with identifiers, where practically all of the problematic characters are already disallowed. If your intent is to sort or search a document using fuzzy equivalences, then you are not required to limit yourself to the NFK C/D transformations in any way, because you would not be claiming to be normalizing the text in the sense of a Unicode Normalization Form. A./
Re: Question on U+33D7
On 2/23/2012 2:44 PM, António Martins-Tuválkin wrote: It is defined as 33D7;SQUARE PH;So;0;L;square 0050 0048N;SQUARED PH in UnicodeData.txt, but it is shown as pH in code chart. Should it be 0070 0048 or PH? It should certainly be pH, i.e., square0070 0048/square, because that's the peculiar casing in widespread (universal, really) use for this basic Chemistry concept (AFAIK it means power of Hidrogen). See http://en.wikipedia.org/wiki/pH#History. While there's no surprise at PH Unicode names being all caps, I’m surprised that the decomposition mapping is wrongly set to 0050 0048 instead of to 0070 0048. O.k., folks, I guess it's time for everybody to gather around the fire for another episode of Every Character Has a Story. First, to answer Matt Ma's original question, no, the decomposition should *not* be square 0070 0048. The reason for that is simple: no matter what the glyph looks like, or what people think the character might mean, the decomposition mapping is immutable -- constrained by the stability guarantees for Unicode normalization. U+33D7 had that decomposition mapping as of Unicode 3.1, which defines the base for normalization stability, so right or wrong, come hell or high water, it stays that way forever. But that begs the question of how it got to be that way in the first place. To answer that, we have to dig deeper into the history of the encoding. If you will now pull down your copies of Unicode 1.0 off the shelf and turn to p. 362, you will see that U+33D7 was included in Unicode 1.0. Lo and behold, the glyph shown in the charts for U+33D7 is PH, with a capital P, rather than a lowercase p. (The character was also named SQUARED PH, rather than the current SQUARE PH, but the explanation for that will have to wait for another evening.) Unicode 1.0 didn't have any formal decompositions, but Unicode 1.*1* did. In Unicode 1.1, on p. 75, the decomposition for U+33D7 is given as [0050] [0048], reflecting the glyph shown for the character in Unicode 1.0. It was Unicode 2.0 which changed the glyph for U+33D7 to pH, on the assumption that the character must have been intended as a East Asian square symbol representation of the chemical symbol pH. The decomposition for U+33D7 was not adjusted, however, although its format was shifted to square + 0050 P + 0048 H in the charts. Now tracking down the details of the decision process that was involved in changing the glyph for U+33D7 for Unicode 2.0 is pretty difficult. The development of the suite of fonts for printing Unicode 2.0 was a pretty wild and wooly process, as that was the first attempt to print the entire set of charts with outline fonts. Unicode 1.0 had been printed with a bitmap font developed at Xerox in the early early days. Some of the glyph changes between Unicode 1.0 and 2.0 just happened, despite the care which was taken to try to check everything. I'm pretty sure that the glyph change for U+33D7 was discussed by the editors at some point (in either late 1995 or very early 1996), but at that stage in the development of the standard that kind of thing was usually not recorded on an item-by-item basis. Remember, there was a *lot* going on then which was much more important to the UTC than the glyph for some East Asian compatibility character that nobody used: the design of UTF-8 for example! Speaking of use of the character, where *did* it come from exactly, and what was it intended for? Well, that is also problematical. *Most* of the characters in the CJK Compatibility block in the range U+3380..U+33DD can easily be traced to KS X 1001:1992 (then known as KS C 5601) or CNS 11643. But U+33D7, U+33DA, and U+33DB are anomalous. They didn't have any mappings (that I knew about) as of Unicode 1.0. They may have come from some early draft of a Korean standard, or from some Asian company private registry of character extensions, or maybe just from a paper copy of character stuff sitting around at Xerox circa 1989. Nobody really seemed to be sure what they were -- they were just more ill-advised squared East Asian squared abbreviation dreck that was added to the pile and not examined very carefully, because everybody knew that such symbols for SI units (and other scientific and math symbols of their ilk, such as ln for natural logarithm) should just be spelled out with regular characters. We can presume, in hindsight, that U+33D7 *may* have been originally intended as an East Asian character set abbreviation symbol for the chemical concept of pH. U+33D9 was presumably intended for parts per million, although I don't recall that anybody has actually bothered to think about that, and if they had, they might have suggested that the glyph for *that* symbol also be changed, to the more usual lowercase ppm. And U+33DA PR? Who knows? My guess would be an abbreviation for per radian, as in 57.2957 degrees per radian, but your guess is as good as mine. I suppose it could have
Re: Question on UCA collation parameters (strength = tertiary, alternate = shifted)
In addition, the default setting in Table 14, UTS #10, 6.0.0 are strength: tertiary alternative: shifted But the setting won't generate the conformant behavior specified by CollationTest_SHIFTED.txt I think when alternative is set to shifted, strength should be set to quaternary (as default) unless it is explicitly set. Thanks, Matt On Tue, Nov 29, 2011 at 12:55 PM, Matt Ma matt.ma.um...@gmail.com wrote: Thanks for clarification. But to pass UCA conformance test on Shifted, does the strength have to be set to quaternary? Howeve, it is stated in UCA, C2, A conformant implementation shall support at least three levels of collation. Does this mean a UCA conformant implementation only need pass UCA conformance test on Non-Ignorable? Regards, Matt On Tue, Nov 29, 2011 at 12:49 PM, Mark Davis ☕ m...@macchiato.com wrote: Yes, if the strength is tertiary, then Blanked and Shifted give the same results. http://www.unicode.org/reports/tr10/proposed.html#Variable_Weighting Mark — Il meglio è l’inimico del bene — [https://plus.google.com/114199149796022210033] On Tue, Nov 29, 2011 at 19:11, Matt Ma matt.ma.um...@gmail.com wrote: Hi, Does Shifted implies strength being quaternary? If strength stays as tertiary (default or explicitly set), it seems the collation behavior is Blanked. Please clarify. Thanks, Matt
Question on UCA collation parameters (strength = tertiary, alternate = shifted)
Hi, Does Shifted implies strength being quaternary? If strength stays as tertiary (default or explicitly set), it seems the collation behavior is Blanked. Please clarify. Thanks, Matt
Re: Question on UCA collation parameters (strength = tertiary, alternate = shifted)
Thanks for clarification. But to pass UCA conformance test on Shifted, does the strength have to be set to quaternary? Howeve, it is stated in UCA, C2, A conformant implementation shall support at least three levels of collation. Does this mean a UCA conformant implementation only need pass UCA conformance test on Non-Ignorable? Regards, Matt On Tue, Nov 29, 2011 at 12:49 PM, Mark Davis ☕ m...@macchiato.com wrote: Yes, if the strength is tertiary, then Blanked and Shifted give the same results. http://www.unicode.org/reports/tr10/proposed.html#Variable_Weighting Mark — Il meglio è l’inimico del bene — [https://plus.google.com/114199149796022210033] On Tue, Nov 29, 2011 at 19:11, Matt Ma matt.ma.um...@gmail.com wrote: Hi, Does Shifted implies strength being quaternary? If strength stays as tertiary (default or explicitly set), it seems the collation behavior is Blanked. Please clarify. Thanks, Matt
RE: Pupil's question about Burmese
FWIW: The OS really likes Unicode, so lots of the text input, etc, are really Unicode. ANSI apps (including non-Unicode web pages), get the data back from those controls in ANSI, so you can lose data that it looked like you entered. As mentioned, the solution is to fix the app to use Unicode. Especially for a language like this. In these cases, machines will be fairly inconsistent even if they did support some code page, but Unicode works most everywhere. Usually it's not difficult for a web page to switch to UTF-8. If it's a form, it's even possible that overriding it on your end might get the data posted back in UTF-8 and succeed (if you're really lucky), but the real fix is to have the web server serve Unicode. -Shawn http://blogs.msdn.com/shawnste From: unicode-bou...@unicode.org [unicode-bou...@unicode.org] on behalf of Peter Constable [peter...@microsoft.com] Sent: Tuesday, November 09, 2010 10:42 PM To: James Lin; Ed Cc: Unicode Mailing List Subject: RE: Pupil's question about Burmese A non-Unicode web page is like a non-Unicode app. Web pages, and apps, should use Unicode.' Peter -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of James Lin Sent: Tuesday, November 09, 2010 11:24 AM To: Ed Cc: Unicode Mailing List Subject: RE: Pupil's question about Burmese Oh, don't get me wrong. By having Unicode is like wearing a crown and be a king. It's best thing out there. What I am referring is, if a web page is not Unicode supported, or any applications that do not support Unicode, even if running a windows 7 with English locale(even though natively, it supports UTF-16), it is not possible to directly copy/paste without having the correct supported locale, if not, you may damaging the bytes of the characters which show corruptions. Even though most modern API is and hopefully written in Unicode calls, not all (legacy) applications are written in Unicode, so conversion is still necessary to even handling the non-ASCII data. Let me know if I am still missing something here. -Original Message- From: Ed [mailto:ed.tra...@gmail.com] Sent: Tuesday, November 09, 2010 11:02 AM To: James Lin Cc: Unicode Mailing List Subject: Re: Pupil's question about Burmese Yes, displaying is fine, but the original question is copying and pasting; without the correct locale settings, you can’t copy/paste without corrupting the byte sizes. Copy/paste is generally handle by OS itself, not application. Even if you have unicode support application, you can display, but you can’t handle none-ASCII characters. Why not? Modern Win32 OSes use UTF-16. Presumably most modern applications are written using calls to the modern API which should seamlessly support copy-and-paste of Unicode text, regardless of script or language -- so long as the script or language is supported at the level of displaying the text correctly and you have a font that works for that script. Actually, even if the text display is imperfectly (i.e., one sees square boxes when lacking a proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex Text Layout script like Burmese), copy-and-paste of the raw Unicode text should still work correctly. Is this not the case?
Re: Pupil's question about Burmese
On 11/10/2010 02:17 PM, Shawn Steele wrote: As mentioned, the solution is to fix the app to use Unicode. Especially for a language like this. In these cases, machines will be fairly inconsistent even if they did support some code page, but Unicode works most everywhere. Afaik there never has been a standard code page for Myanmar text, Unicode was the first time storage of Burmese text was standardised for computers. There are several different legacy font families in use for Myanmar each with their own slightly different mapping to Latin code points. The font in question has a Unicode cmap table, but the map is from Latin code points to glyphs, not from Myanmar code points to glyphs. There are also several fonts which map incorrectly from the Myanmar Unicode block using the Mon, Shan and Karen code points for glyph variants so the font can avoid having OpenType/Graphite/AAT rules. If anyone is having trouble installing genuine Myanmar Unicode fonts, then I have some instructions at http://www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/gettingStarted.php Keith
Re: Pupil's question about Burmese
Dear Peter Constable, * Burmese_is_supported in windows.* It makes worse than ever to create another story like pseudo-unicode like Zawgyi in Windows. too. We are in dead-lock because without releasing Myanmar Opentype specifiction for burmese by Microsoft. We can't implement burmese in opentype adopted rendering engine like pango and harfbuzz. We are not satisify just typing burmese text and printing burmese text. We want to have effective use of unicode data in burmese language processing like spelling check, machine translation and OCR. So, Do we need system locale for Burmese? How about CultureInfo for Microsoft .Net Framework. I've encouraged to use Unicode standards among Myanmar Users. Myanmar Users willing to use unicode standards in their works, personal and every application. But there are no advantages in using Unicode Standards and CLDR too. If Unicode.org make standards and do not apply those standards in software and systems, how can we trust those standards. Myanmar Users do not wait on Microsoft, Apple, Oracle implementation. They are going wrong or breakthrough solution. Again. I have to say caution about ethnics language. We should take care about Mon, Shan and Karen Language which is encoded in Unicode 5.1 But Microsoft didn't assign yet for those language in Windows 7 I'm trying to get Burmese Language Pack in Microsoft Windows .since 2002. I gave up and no more try to get it. Microsoft not waiting stable Standards, Politics and/or Technical. I don't not any of reason for delaying our beloved language. Thanks for reading it and support for 40 million speaking language. We did petition to Microsoft at http://petition.myanmarlanguage.org/ http://my.wiktionary.org is the good dictionary site. It is started but not yet finisned. Best Ngwe Tun On Tue, Nov 9, 2010 at 8:52 AM, Peter Constable peter...@microsoft.comwrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Andrew Cunningham Your system locale has to handle the Burmese language. So you need to either install Windows 7 in Burmese or change under Regional / Language options in Control panel, under Adv tab. well considering Burmese is a language that is not supported by Microsoft ... the above is relatively irrelevant. At whatever point Burmese _is_ supported in Windows, system locale will not be relevant. To be clear, the legacy Windows notion of system locale is relevant only in relation to apps that support only legacy Windows encodings, not Unicode. There is no system locale support for languages such as Hindi or Armenian or Khmer, but that does not prevent display of text in those scripts in Unicode-capable applications. So, for instance, every copy of Windows 2000 or later versions is capable of displaying Hindi or Armenian text, regardless of the system locale setting; every copy of Windows Vista or later is capable of displaying, in addition, text in scripts such as Khmer and Ethiopic; and every copy of Windows 7 is, additionally, able to display text in scripts Tifinagh and Tai Le. In all these cases, the system locale setting has no bearing. Peter
Re: Pupil's question about Burmese
Dear Ngwe Tun, The forthcoming ICU 4.6 will include a Burmese locale (using CLDR data), with support for Burmese collation. http://site.icu-project.org/ Best regards, Peter Edberg On Nov 9, 2010, at 2:05 AM, Ngwe Tun wrote: ... We are in dead-lock because without releasing Myanmar Opentype specifiction for burmese by Microsoft. We can't implement burmese in opentype adopted rendering engine like pango and harfbuzz. We are not satisify just typing burmese text and printing burmese text. We want to have effective use of unicode data in burmese language processing like spelling check, machine translation and OCR. ... I've encouraged to use Unicode standards among Myanmar Users. Myanmar Users willing to use unicode standards in their works, personal and every application. But there are no advantages in using Unicode Standards and CLDR too. If Unicode.org make standards and do not apply those standards in software and systems, how can we trust those standards. Myanmar Users do not wait on Microsoft, Apple, Oracle implementation. They are going wrong or breakthrough solution.
Re: Pupil's question about Burmese
So, for instance, every copy of Windows 2000 or later versions is capable of displaying Hindi or Armenian text, regardless of the system locale setting; every copy of Windows Vista or later is capable of displaying, in addition, text in scripts such as Khmer and Ethiopic; and every copy of Windows 7 is, additionally, able to display text in scripts Tifinagh and Tai Le. In all these cases, the system locale setting has no bearing. Yes, displaying is fine, but the original question is copying and pasting; without the correct locale settings, you can¹t copy/paste without corrupting the byte sizes. Copy/paste is generally handle by OS itself, not application. Even if you have unicode support application, you can display, but you can¹t handle none-ASCII characters. On 11/8/10 6:22 PM, Peter Constable peter...@microsoft.com wrote: From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of Andrew Cunningham Your system locale has to handle the Burmese language. So you need to either install Windows 7 in Burmese or change under Regional / Language options in Control panel, under Adv tab. well considering Burmese is a language that is not supported by Microsoft ... the above is relatively irrelevant. At whatever point Burmese _is_ supported in Windows, system locale will not be relevant. To be clear, the legacy Windows notion of system locale is relevant only in relation to apps that support only legacy Windows encodings, not Unicode. There is no system locale support for languages such as Hindi or Armenian or Khmer, but that does not prevent display of text in those scripts in Unicode-capable applications. So, for instance, every copy of Windows 2000 or later versions is capable of displaying Hindi or Armenian text, regardless of the system locale setting; every copy of Windows Vista or later is capable of displaying, in addition, text in scripts such as Khmer and Ethiopic; and every copy of Windows 7 is, additionally, able to display text in scripts Tifinagh and Tai Le. In all these cases, the system locale setting has no bearing. Peter
Re: Pupil's question about Burmese
Yes, displaying is fine, but the original question is copying and pasting; without the correct locale settings, you can’t copy/paste without corrupting the byte sizes. Copy/paste is generally handle by OS itself, not application. Even if you have unicode support application, you can display, but you can’t handle none-ASCII characters. Why not? Modern Win32 OSes use UTF-16. Presumably most modern applications are written using calls to the modern API which should seamlessly support copy-and-paste of Unicode text, regardless of script or language -- so long as the script or language is supported at the level of displaying the text correctly and you have a font that works for that script. Actually, even if the text display is imperfectly (i.e., one sees square boxes when lacking a proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex Text Layout script like Burmese), copy-and-paste of the raw Unicode text should still work correctly. Is this not the case?
RE: Pupil's question about Burmese
Oh, don't get me wrong. By having Unicode is like wearing a crown and be a king. It's best thing out there. What I am referring is, if a web page is not Unicode supported, or any applications that do not support Unicode, even if running a windows 7 with English locale(even though natively, it supports UTF-16), it is not possible to directly copy/paste without having the correct supported locale, if not, you may damaging the bytes of the characters which show corruptions. Even though most modern API is and hopefully written in Unicode calls, not all (legacy) applications are written in Unicode, so conversion is still necessary to even handling the non-ASCII data. Let me know if I am still missing something here. -Original Message- From: Ed [mailto:ed.tra...@gmail.com] Sent: Tuesday, November 09, 2010 11:02 AM To: James Lin Cc: Unicode Mailing List Subject: Re: Pupil's question about Burmese Yes, displaying is fine, but the original question is copying and pasting; without the correct locale settings, you can’t copy/paste without corrupting the byte sizes. Copy/paste is generally handle by OS itself, not application. Even if you have unicode support application, you can display, but you can’t handle none-ASCII characters. Why not? Modern Win32 OSes use UTF-16. Presumably most modern applications are written using calls to the modern API which should seamlessly support copy-and-paste of Unicode text, regardless of script or language -- so long as the script or language is supported at the level of displaying the text correctly and you have a font that works for that script. Actually, even if the text display is imperfectly (i.e., one sees square boxes when lacking a proper font, or even if OpenType GPOSs and GSUBs are not correct for a Complex Text Layout script like Burmese), copy-and-paste of the raw Unicode text should still work correctly. Is this not the case?