Re: Another take on the English Apostrophe in Unicode
And, Marcel, while you are at it, this is getting tiresome. Please find some other place to vent about events you know very little about; the internet is full of them. Mark Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Tue, Jun 16, 2015 at 7:33 PM, Doug Ewell d...@ewellic.org wrote: Marcel Schneider charupdate at orange dot fr wrote: That's to despise people, that's to spit at their face. You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. I do wish we could put an end to all the accusations of malfeasance. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Another take on the English apostrophe in Unicode
On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider charupd...@orange.fr wrote: When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities. Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake. Quite nice of you to inform me of the core mission of Unicode—I must have somehow missed that. More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. So we only separated essentially identical characters in limited cases: such as letters from different scripts. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —*
Re: Another take on the English apostrophe in Unicode
On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable peter...@microsoft.com wrote: When it comes to orthography, the notion of what comprise words of a language is generally pure convention. That’s because there isn’t any single *_linguistic_ *definition of word that gives the same answer when phonological vs. morphological or syntactic criteria are applied. There are book-length works on just this topic, such as this: In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for supercali...docious. (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —*
Re: free download of ISO/IEC 10646 (was: Accessing the WG2 document register)
I think the whole thread got overheated, and Andrew was just responding to other heated comments. So it might be time to let this thread cool off a bit. The collaboration over the years between the Unicode Consortium and ISO has been, on the whole, a remarkable success. There have been frictions—as in any human enterprise—but the parties have worked to smooth those over, and to operate in good faith to incorporate the characters that are important to each side. The rising bureaucracy on the ISO side has made progress and collaboration increasingly difficult, but that did not originate with the SC2 or WG2 participants, who are often just as frustrated by it.
Re: http://✈.ws
Whoops, sent too soon. A surprise: http://✈.ws Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Fri, Jun 5, 2015 at 4:47 PM, Mark Davis ☕️ m...@macchiato.com wrote:
http://✈.ws
Re: The Oral History Of The Poop Emoji
One of many on http://unicode.org/press/emoji.html Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Jun 1, 2015 at 8:23 PM, Karl Williamson pub...@khwilliamson.com wrote: https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america
Re: FYI: The world’s languages, in 7 maps and charts
Hmmm. How accurate can it be? They forgot Austria, and got Switzerland wrong by almost a power of 10. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye moy...@gmail.com wrote: The South China Morning Post published a similar infographic: A world of languages - and how many speak them http://www.scmp.com/infographics/article/1810040/infographic-world-languages
Re: FYI: The world's languages, in 7 maps and charts
I think it is gives a misleading picture to only include mother-language speakers, rather than all languages (at a reasonable level of fluency). Every Swiss German is fluent in High German. Part of the problem is that it is very hard to get good data on the multiple languages that people speak—a huge number of people are fluent in more than one—and on the level of fluency in each. That alone makes it difficult to do accurate representations. That level of accuracy may not be necessary to get a general picture, but when the map purports to go into great detail... Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, May 27, 2015 at 4:59 PM, Denis Jacquerye moy...@gmail.com wrote: The data used to build the infographic comes from Ethnologue.com. http://www.ethnologue.com/language/deu does not indicate the Standard German L1 population in Austria and gives a population of 727 000 Standard German L1 speakers in Switzerland (the difference is counted as Swiss German L1 speakers). On Wed, 27 May 2015 at 11:22 Mark Davis [image: ☕]️ m...@macchiato.com wrote: Hmmm. How accurate can it be? They forgot Austria, and got Switzerland wrong by almost a power of 10. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye moy...@gmail.com wrote: The South China Morning Post published a similar infographic: A world of languages - and how many speak them http://www.scmp.com/infographics/article/1810040/infographic-world-languages
Re: Tag characters
A few notes. A more concrete proposal will be in a PRI to be issued soon, and people will have a chance to comment more then. (I'm not trying to discourage discussion, just pointing out that there will be something more concrete relatively soon to comment on—people are pretty busy getting 8.0 out the door right now.) The principal reason for 3 digit codes is because that is the mechanism used by BCP47 in case ISO screws up codes (as they did for CS). The syntax does not need to follow the 3166 syntax - the codes correspond but are not the same anyway. So we didn't see the necessity for the hyphen, syntactically. There is a difference between EU and UN; the former is in BCP47. That being said, we could look at making the exceptionally reserved codes valid for this purpose (or at least the UN code). It appears that there are only 3 exceptionally reserved codes that aren't in BCP47: EZ, UK, UN. Just because a code is valid doesn't mean that there is a flag associated with it. Just like the fact that you can have the BCP47 code ja-Ahom-AQ doesn't mean that it denotes anything useful. I'd expect vendors to not waste time with non-existent flags. However, we could also discuss having a mechanism in CLDR to help provide guidelines as to which subdivisions are suitable as flags. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Sat, May 16, 2015 at 10:07 AM, Doug Ewell d...@ewellic.org wrote: L2/15-145R says: On some platforms that support a number of emoji flags, there is substantial demand to support additional flags for the following: [...] Certain supra-national regions, such as Europe (European Union flag) or the world (e.g. United Nations flag). These can be represented using UN M49 3-digit codes, for example 150 for Europe or 001 for World. These are uncomfortable equivalence classes. Not all countries in Europe are members of the European Union, and the concept of United Nations is not really the same by definition as all countries in the world. The remaining UN M.49 code elements that don't have a 3166-1 equivalent seem wholly unsuited for this mechanism (and those that do, don't need it). There are no flags for Middle Africa or Latin America and the Caribbean or Landlocked developing countries. Some trans-national organizations might _almost_ seem as if they could be shoehorned into an M.49 code element, like identifying 035 South-Eastern Asia with the ASEAN flag, but this would be problematic for the same reasons as 150 and 001. Among the ISO 3166-1 exceptionally reserved code elements are EU for European Union and UN for United Nations. If these flags are the use cases, why not simply use those alpha-2 code elements, instead of burdening the new mechanism with the 3-digit syntax? -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters
The consortium is in no position to enhance protocols *itself* for exchanging images. That's firmly in other groups' hands. We can try to noodge them a bit, but what *will* make a difference is when the *vendors* of sticker solutions put pressure on the different groups responsible for the protocols to provide interoperability for images. Because there is a lot of growth in sticker solutions, I would expect there to be more such pressure. And even so, I expect it will take those some time to be deployed. We've said what our longer-term position is, and I think we all pretty much agree with that; exchanging images is much more flexible. However, we do have strong short-term pressure to show that we are responsive and responsible in adding emoji. And our adding a reasonable number of emoji per year is not going to stop Line or Skype from adding stickers! There are a few possible scenarios, and it's hard to predict the results. It could be that emoji are largely supplanted by stickers in 5 years; could be 10; could be that they both coexist indefinitely. I have no , and neither does anyone else... Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, May 14, 2015 at 7:44 PM, Peter Constable peter...@microsoft.com wrote: And yet UTC devotes lots of effort (with an entire subcommittee) to encode more emoji as characters, but no effort toward any preferred longer term solution not based on characters. Peter *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Shervin Afshar *Sent:* Thursday, May 14, 2015 2:27 PM *To:* wjgo_10...@btinternet.com *Cc:* unicode@unicode.org *Subject:* Re: Tag characters Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions IMO, the industry preferred longer term solution (which is also discussed in that section with few existing examples) for emoji, is not going to be based on characters. ↪ Shervin On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington wjgo_10...@btinternet.com wrote: What else would be possible if the same sort of technique were applied to another base character? Thinking about this further, could the technique be used to solve the requirements of section 8 Longer Term Solutions of http://www.unicode.org/reports/tr51/tr51-2.html ? Both colour pixel map and colour OpenType vector font solutions would be possible. Colour voxel map and colour vector 3d solids solutions are worth thinking about too as fun coding thought experiments that could possibly lead to useful practical results. William Overington 14 May 2015
FYI: The world’s languages, in 7 maps and charts
http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/
Re: Script / font support in Windows 10
Thanks! Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Fri, May 8, 2015 at 7:15 AM, Peter Constable peter...@microsoft.com wrote: I think this is the right public link: https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx *From:* Peter Constable *Sent:* Thursday, May 7, 2015 10:29 PM *To:* Peter Constable; unicode@unicode.org *Subject:* RE: Script / font support in Windows 10 Oops… my bad: maybe it isn’t on live servers yet. It will be soon. I’ll update with the public link when it is. *From:* Unicode [mailto:unicode-boun...@unicode.org unicode-boun...@unicode.org] *On Behalf Of *Peter Constable *Sent:* Thursday, May 7, 2015 10:15 PM *To:* unicode@unicode.org *Subject:* Script / font support in Windows 10 This page on MSDN that provides an overview of Windows support for different scripts has now been updated for Windows 10: https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099 Peter
Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?
The simplest approach would be to use ICU in a little program that scans the file. For example, you could write a little Java program that would scan the file, and turn any any sequence of (\u)+ into a String, then test that string with: static final UnicodeSet OK = new UnicodeSet([^[:unassigned:][:surrogate:]]]).freeze(); ... // inside the scanning function boolean isOk = OK.containsAll(slashUString); It is key that it has to grab the entire sequence of \u in a row; otherwise it will get the wrong answer. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, May 7, 2015 at 10:49 AM, Doug Ewell d...@ewellic.org wrote: Costello, Roger L. Costello at mitre dot org wrote: Are there tools to scan a JSON document to detect the presence of \u, where does not correspond to any Unicode character? A tool like this would need to scan the Unicode Character Database, for some given version, to determine which code points have been allocated to a coded character in that version and which have not. -- Doug Ewell | http://ewellic.org | Thornton, CO
Combining character example
I happened to run across a good example of productive use of combining marks, the Duden site (a great online dictionary for German). They use U+0323 ( ̣) COMBINING DOT BELOW to indicate the stress. Here is an example: ụnterbuttern http://www.duden.de/rechtschreibung/unterbuttern They aren't, however, consistent; you also see underlining for stress. e̲i̲nschränken But not, interestingly, with the HTML underline, but with U+0332 ( ̲ ) COMBINING LOW LINE. Mark https://google.com/+MarkDavis
Re: Combining character example
Thanks for the corrections; I should have looked for a key to the conventions they use. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Apr 16, 2015 at 11:32 AM, Jörg Knappen jknap...@web.de wrote: Hi Mark, the use of DOT BELOW and LINE BELOW is in fact consistent in German Duden. The difference in the diacritics is used to denote length of the stressed vowel, DOT BELOW denotes a short vowel and LINE BELOW denotes a long vowel. Diphthongs are always long and there is a single line under the whole Diphthong. Digraphs (e.g. the ou in words borrowed from French) also have either a single line under the whole digraph or (this happens rarely) a single dot in the middle of the digraph. --Jörg Knappen *Gesendet:* Donnerstag, 16. April 2015 um 10:01 Uhr *Von:* Mark Davis [image: ☕]️ m...@macchiato.com *An:* Unicode Public unicode@unicode.org, Unicode Book b...@unicode.org *Betreff:* Combining character example I happened to run across a good example of productive use of combining marks, the Duden site (a great online dictionary for German). They use U+0323 ( ̣) COMBINING DOT BELOW to indicate the stress. Here is an example: ụnterbuttern http://www.duden.de/rechtschreibung/unterbuttern They aren't, however, consistent; you also see underlining for stress. e̲i̲nschränken But not, interestingly, with the HTML underline, but with U+0332 ( ̲ ) COMBINING LOW LINE. Mark https://google.com/+MarkDavis
Re: Are you CONFUSED about WHAT CHARACTER(S) you type?!?!
It only provides a stand-in glyph if you don't otherwise have a font for that character on your system. That stand-in just indicates the type of character (eg script). No single font with current technology can handle all of Unicode. The most complete open font set I know of is the Noto family: https://www.google.com/get/noto/. I don't think it has a full set of symbols (others: correct me if I'm wrong.) Symbola is pretty good for arbitrary symbols. There are many other resources on http://unicode.org/resources/fonts.html. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Mar 26, 2015 at 8:53 AM, Michael McGlothlin mike.mcgloth...@gmail.com wrote: Similar but with a couple differences. Most important would be getting vendors to actually use the font. Also it should be appropriate to actually display the characters rather than being debugging information. Does this last resort font represent every character in some meaningful way? e.g. I've tried to use somewhat rare characters like runes before and it was a pretty big pain to find fonts that were free to distribute, weren't buggy, and displayed the correct symbol for that character. And some applications wouldn't display them correctly even after installing a font. (Visual Studio let me use runes as variable names and compiled fine but wouldn't actually display the rune symbols.) Sent from my iPad On Mar 25, 2015, at 5:18 PM, Shervin Afshar shervinafs...@gmail.com wrote: Just like Unicode Last Resort Font[1]? [1]: http://www.unicode.org/policies/lastresortfont_eula.html ↪ Shervin On Wed, Mar 25, 2015 at 2:24 PM, Michael McGlothlin mike.mcgloth...@gmail.com wrote: I'd like to see a free/open default font that has a correct, simple styled, symbol for every Unicode character. Vendors should be pressured to use this font when other options aren't available. I get tired of seeing default symbols, incorrect symbols, and mystery white spaces that aren't really white space. It's pretty silly to have a code point without a default symbol I think. Thanks, Michael McGlothlin Sent from my iPhone On Mar 25, 2015, at 12:20 PM, Robert Wheelock rwhlk...@gmail.com wrote: Hello! When you’re typing, do you find yourself winding up being CONFUSED over what you type?!?! It’s a crucially SERIOUS matter—especially when typing on a computer! For instance: When you type in a HOLLOW HEART SUIT (U+02661), it may show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI (U+0039E)... it all DEPENDS on whatever FONT you’re using to type with! The default Microsoft Sans Serif font (within Microsoft Windows) has this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at U+02261)—because Microsoft (regrettably) placed this math symbol where the HOLLOW HEART SUIT should be (at U+02661)! * ¡AGONISTES!* What Microsoft SHOULD DO *is* *THIS*: Please move the IDENTICAL TO SIGN from (U+02661—the location where the HOLLOW HEART SUIT goes) to its PROPER LOCATION at (U+02261)!! THAT would be MUCH better!! What other CHARACTER CALAMITIES have you come across?!?! Thank You! ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Android 5.1 ships with support for several minority scripts
Congrats! {phone} On Mar 14, 2015 03:09, Roozbeh Pournader rooz...@unicode.org wrote: Android 5.1 http://officialandroid.blogspot.com/2015/03/android-51-unwrapping-new-lollipop.html, released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto https://code.google.com/p/noto/ and HarfBuzz http://www.freedesktop.org/wiki/Software/HarfBuzz/ projects, both of which are open source. (Android itself is open source too.) By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One http://www.android.com/one/ phones and most Nexus http://www.google.com/nexus/ devices will support all the above scripts out of the box. None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Emoji (was: Re: Unicode block for programming related symbols and codepoints?)
We are being pretty conservative about what we add. There are approximately 1,200 emoji characters now (see tr51), and we're anticipating adding perhaps 50 per release. And we are encouraging a sticker approach for the longer term. On the other hand, I wouldn't be surprised if the 41 emoji characters that we are planning on for Unicode 8.0 end up having a higher frequency of use than the other 7K characters in the release. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Feb 9, 2015 at 9:36 PM, Michael Everson ever...@evertype.com wrote: I like symbols a lot. But I know that I and a number of people have been thinking that too much emphasis is being put on emoji. Michael Everson * http://www.evertype.com/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Emoji (was: Re: Unicode block for programming related symbols and codepoints?)
In what character encoding standard, or extension, does ROBOT FACE appear? Unicode has never been limited to what is in other character encoding standard or extensions, official or de facto. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Feb 9, 2015 at 9:16 PM, Doug Ewell d...@ewellic.org wrote: Shervin Afshar shervinafshar at gmail dot com wrote: There is no longer any requirement that the robot faces and burritos appear first in any sort of industry character set extension, with which Unicode is then obliged to maintain compatibility. Only if you don't consider existing usage and popular requests as requirement and precedence; for example Gmail had Robot Face for a long time. I said there was no longer a requirement *that the items appear first in an industry character set extension*, right? In what character encoding standard, or extension, does ROBOT FACE appear? Gmail has it is not a character encoding standard. Neither is People want to see it. Most popularly requested, as a criterion for adding a character, is absolutely new to Unicode. Earlier I wrote privately to a Unicode officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no reply. (What, you've forgotten the ice-bucket craze already? That's exactly why most popular at the moment wasn't supposed to be a criterion.) -- Doug Ewell | Thornton, CO, USA | http://ewellic.org ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: About cultural/languages communities flags
On Tue, Feb 10, 2015 at 12:11 AM, Ken Whistler kenwhist...@att.net wrote: for the full context, and for the current 26x26 letter matrix which is the basis for the flag glyph implementations of regional indicator code pairs on smartphones. SC, SO, ST are already taken, but might I suggest putting in for registering AB for Alba? That one is currently unassigned. Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter code?! But seriously, if folks are planning ahead for Scots independence or even some kind of greater autonomy, this is an issue that needs to be worked, anyway. In the meantime, let me reiterate that there is *no* formal relationship between TLD's and the regional indicator codes in Unicode (or the implementations built upon them). Well, yes, a bunch of registered TLD's do match the country codes, but there is no two-letter constraint on TLD's. This should already be apparent, as Scotland has registered .scot At this point there isn't even a limitation of TLD's to ASCII letters, so there is no way to map them to the limited set of regional indicator codes in the Unicode Standard. Not having a two letter country code for Scotland that matches the four letter TLD for Scotland might indeed be a problem for someone, but I don't see *this* as a problem that the Unicode Standard needs to solve. I want to add to that that there are already a fair number of ISO 2-letter codes for regions that are administered as part of another country, like Hong Kong. There are also codes for crown possessions like Guernsey. So having a code for Scotland (and Wales, and N. Ireland) do not really break precedent. But as Ken says, the best mechanism is for the UK to push for a code in ISO and the UN. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: UAX 29 questions
I apology in advance that I'm running low on time, and didn't go through all the messages on this thread carefully. So I may not be fully appreciating people's positions. I'm just making some quick points about 2 items that caught my eye. 1. There are certainly times where two rules in sequence may overlap, just for simplicity. X Y* x Z Y x Z* W The first rule could trigger on X Y Z W, even though the second would also trigger on it. This may or may not be sloppiness; sometimes it simply makes the second rule too convoluted to also exclude triggering on everything that could possibly trigger earlier. That being said, if there simplifications in the rules that would make it clearer, I'd suggest submitting a proposal for that. The UTC is meeting next week, and could consider it either then or at subsequent meetings. Note: the HTML files in http://unicode.org/Public/UNIDATA/auxiliary/ have a number of sample cases (which are also used in the test files). Hovering over boundaries in those sample cases shows which rule is triggered, such as in http://unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakTest.html#samples We're always open to additional samples that are illustrative of how the rules work. As I thought about your message, it became clear to me that it would be useful to have a complete enough set of sample cases that each rule is triggered by at least one case, if you or anyone else is interested in helping to add those. 2. Also, the following 2 rules are not equivalent: a) Any × (Format | Extend) b) X (Extend | Format)* → X (b) implies (a), but not the reverse. The difference is on the right side of characters. Rule b, affects every subsequent rule, and can be viewed as a shorthand. After it, we can just say: A B × C D And that has the effect of saying: A (Extend | Format)* B (Extend | Format)* × C (Extend | Format)* D See also http://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules However, it may not be clear that (b) implies (a); that might be what you are getting at. If so, then we could add an explicit statement to that effect. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Jan 29, 2015 at 7:52 PM, Karl Williamson pub...@khwilliamson.com wrote: On 01/25/2015 05:14 AM, Philippe Verdy wrote: This is not a contradiction. At the very least it is too sloppy for a standard. Once there is a match in the list of rules, later rules shouldn't have to be looked at. I'll submit a formal feedback form. But there is another issue as well. I do not see how the specified rules when applied to the sequence of code points: U+0041 U+200D U+0020 cause the ZWJ, an Extend, to not break with the A, an ALetter. Rule WB4 is Ignore Format and Extend characters, except when they appear at the beginning of a region of text.. Not clearly stated, but it appears to me that the ZWJ must be considered here to be the beginning of a region of text, as we are looking at the boundary between it and the A. No rule specifically mentions ALetter followed by an Extend, so by the default rule, WB14 Otherwise, break everywhere (including around ideographs) this should be a word break position. But that is absurd, as the Extend is supposed to extend what precedes it. If I add a rule Don't break before Extend or Format × (Extend | Format) my implementation passes all tests. I added this rule before WB4. combine the two rules and they are equivalent to these two alternate rules: WB56 can be read as these two: (WB56a) ALetter × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter) (WB56b) Hebrew_Letter × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter) Then add : (WB57) Hebrew_Letter × Single_Quote it just removes the condition of a letter following the quote in WB56b. So that WB56b and WB57 can be read as equivalent to these two: (WB56c) Hebrew_Letter × (MidLetter | MidNumLet) (ALetter | Hebrew_Letter) (WB57) Hebrew_Letter × Single_Quote But you cannot merge any of these two last rules in a single rule for WB56. 2015-01-25 7:26 GMT+01:00 Karl Williamson pub...@khwilliamson.com mailto:pub...@khwilliamson.com: I vaguely recall asking something like this before, but if so, I didn't save the answers, and a search of the archives didn't turn up anything. Some of the rules in UAX #29 don't make sense to me. For example, rule WB7a Hebrew_Letter × Single_Quote seems to say that a Hebrew_Letter followed by a Single Quote shouldn't break. (And Rule WB4 says that actually there can be Extend and Format characters between the two and those should be ignored). But the earlier rule, WB6 (ALetter | Hebrew_Letter) × (MidLetter | MidNumLet | Single_Quote) (ALetter | Hebrew_Letter) seems to me to say (among other things) that a Hebrew Letter
Re: (R), (c) and ™
On Thu, Dec 18, 2014 at 11:31 AM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: standard variant sensitive It is not clear what you mean by standard variant sensitive. Can you elaborate? Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: (R), (c) and ™
Note that emoji ≠ present in http://www.unicode.org/Public/UNIDATA/EmojiSources.txt It would probably be useful to read through http://www.unicode.org/reports/tr51/, which is where we are working on various aspects of emoji, in your case especially - http://www.unicode.org/reports/tr51/#Identification - http://www.unicode.org/reports/tr51/#Presentation_Style There are charts attached to the TR that can also be reviewed (and commented on), such as http://www.unicode.org/Public/emoji/1.0/text-style.html If you have feedback on the data (either supporting what is there, or recommending changes), you can submit your feedback via a link to Feedback (found at the top, and in the review notes for each of the sections). We haven't yet made firm recommendations on the variation selectors or the default emoji style, so what is there is a fairly a raw draft. (but we are making progress; see https://plus.google.com/+MarkDavis/posts/MLqEc79yN22). Personally, I think that if a character is in the recommended list for emoji, then: - if the default style is text, we must have variation selectors. - if the default style is emoji, then we should have variation selectors if it is in common use with a non-emoji presentation (typical for characters that have been in Unicode for a long time). Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Dec 18, 2014 at 12:09 PM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: Thanks Mark, I mean not listened anywhere here: http://unicode.org/Public/UNIDATA/StandardizedVariants.txt I'd expect to find the following there: 00A9 FE0E; text style; # COPY RIGHT MARK 00A9 FE0F; emoji style; # COPY RIGHT MARK for the simple reason that 00A9 is listed as emoji: http://www.unicode.org/Public/UNIDATA/EmojiSources.txt Apparently there's no place that says FE0F should affect 00A9, neither a place that states the opposite: 00A9 FE0E as text. Are my expectations wrong or should these chars handled any differently from other emoji ? Thanks On Thu, Dec 18, 2014 at 11:03 AM, Mark Davis [image: ☕]️ m...@macchiato.com wrote: On Thu, Dec 18, 2014 at 11:31 AM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: standard variant sensitive It is not clear what you mean by standard variant sensitive. Can you elaborate? Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: emoji are clearly the current meme fad
We just had a new blog posting; we've moved the media list out of tr51, and the list already had that item on it. See: http://www.unicode.org/press/emoji.html#media Separately, I keep a list of how the media refers to the Unicode consortium: my favorite is shadowy emoji overlords. Bonus points to the first person who can find the one that refers to us as part of a shameful plot to destroy the institution of marriage... Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Tue, Dec 16, 2014 at 6:36 PM, Asmus Freytag asm...@ix.netcom.com wrote: Everybody wants in on the act: http://mashable.com/2014/12/12/bill-nye-evolution-emoji/ A./ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: emoji are clearly the current meme fad
On Wed, Dec 17, 2014 at 9:03 PM, Murray Sargent murr...@exchange.microsoft.com wrote: http://www.theguardian.com/commentisfree/2014/nov/28/the-problem-with-emojis Bingo, Murray wins the prize! [image: Inline image 1] Not to open until Christmas... ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: The rapid ... erosion of definition ability
On Mon Nov 17 2014 at 12:15:08 PM Andreas Stötzner a...@signographie.de wrote: Am 17.11.2014 um 11:46 schrieb Leonardo Boiko: Sign is too general in its generality it is just perfect. The sets of signs in question are most general, covering much more matters, objects and topics than the actual emoticons. They’re just signs and that’s it. The term 'emoji' is certainly a useful term for people to use, denoting a certain kind of symbol. Saying that one should never use it is like saying that one should never say dog or cat, only the generic animal... The UCS defines the 1F600 set properly as Emoticons. At least, we should (in English) speak of Emoticons and not Emoji. Not really (and we don't really define them as emoticons; that's just the block name—and arguably should should have been different). Other “symbols” (another misnomer i.m.h.o., but that’s another story) Not, at least, in English. of this kind are termed “Miscellaneous Symbols and Pictographs”. This is not bad but unprecise as well since many of these signs are not pictographs but ideographs. We warn people in multiple places that the names of blocks are *not* reliable guides to the kinds of characters in the block. Yeah what the heck ;) We have a long tradition of naming these things rather lousy (“Dingbats”). I am a traditionalist as a matter of fact but if precise terming is tricky I find it better to generalize than to blur. I generally agree about the utility of having generic terms in a language. Listening to Swiss newscasts, I find it bizarre to hear pretty clumsy phrasing that is the equivalent of the following (because there is a different form for male and female of many nouns). — The politicians(m) and politicians(f) met with the directors(m) and directors(f), writers(m) and writers(f), and actors(m) and actresses. We suffer from it much less in English, mostly with he and she, although clearly the use of they as a gender-neutral signular is on the upswing (although it's been around for centuries). However, what is most useful is when there are generic terms, *plus* specific ones. ___ Andreas Stötzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 http://stoetzner-gestaltung.prosite.com ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: The rapid ... erosion of definition ability
I agree (except for the derivation of emoji). On Mon Nov 17 2014 at 11:46:58 AM Leonardo Boiko leobo...@namakajiri.net wrote: Sign is too general. The word has no less than 12 meanings, and can refer e.g. to many Unicode characters that are not emojis (the sharp sign, the less-than sign).[1] It's useful to have a specialized word referring specifically to the new pictograms used to color electronic messages with emotional inflection. Borrowing is a perfectly adequate and natural strategy to get such a word into a language – as indeed English did with the word sign, from Old French *signe * Latin *signum* ; and as Japanese did with the English word *emotion *, from which the *emo-* in *emoji, *and with Chinese, from which *-ji* written character. If borrowing words when they're useful is ridiculous, then all languages are ridiculous, and when everything is ridiculous nothing is. [1] http://en.wiktionary.org/wiki/sign 2014-11-17 8:09 GMT-02:00 Andreas Stötzner a...@signographie.de: Am 17.11.2014 um 08:35 schrieb Mark Davis ☕️: IT’S EASY TO DISMISS EMOJI. They are, at first glance, ridiculous The only ridiculous thing is to name them “Emoji” outside Japan. They’re just signs and that’s it. Regards, Andreas Stötzner. ___ Andreas Stötzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 http://stoetzner-gestaltung.prosite.com ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
The rapid evolution of a wordless tongue
http://nymag.com/daily/intelligencer/2014/11/emojis-rapid-evolution.html A more extended article from NY Magazine about the growing usage of emoji, and the ways in which that usage is developing. Has a quote from Peter Constable and (indirect) reference to +Steven R. Loomis. “IT’S EASY TO DISMISS EMOJI. They are, at first glance, ridiculous. They are a small invasive cartoon army of faces and vehicles and flags and food and symbols trying to topple the millennia-long reign of words. Emoji are intended to illustrate, or in some cases replace altogether, the words we send each other digitally, whether in a text message, email, or tweet. Taken together, emoji look like the electronic equivalent of those puffy stickers tweens used to ornament their Trapper Keepers. And yet...” ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Emoji skin tone modifiers on the website of a leading German daily newspaper
As far as I can tell it is garnering interest all over.. Several German publications, including Spiegel, to French and Italian regional papers, to Indonesian, Vietnamese http://www.spiegel.de/netzwelt/web/unicode-consortium-emojis-demnaechst-fuer-alle-hautfarben-a-1001125.html http://m.baohay.vn/chuyen-de/cong-nghe/961227/Bieu-tuong-Emoji-se-co-mau-da-thay-doi.html {phone} On Nov 8, 2014 12:04 AM, Karl Pentzlin karl-pentz...@acssoft.de wrote: FYI: On 2014-11-05, a report on Emoji skin tone modifiers was published on the website of the Frankfurter Allgemeine, a leading German daily newspaper: http://www.faz.net/aktuell/gesellschaft/emoticons-smileys-bald-in-fuenf-hautfarben-13249783.html - Karl Pentzlin ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Open Source Emoji for the Web
One can definitely script it; if you hadn't had compat issues it would be convenient to have the same convention. On Thu Nov 06 2014 at 11:30:09 PM Andrea Giammarchi andrea.giammar...@gmail.com wrote: Thanks Mark, I will consider this change with CDN chaps too since that would invalidate already a lot of cached content at the time it'll ship :-/ We should have paid more attention, on the other side if you need assets locally instead of via CDN a script capable of renaming assets from current form to your suggested one seems straight forward to me. Would that (sort of) work? Thanks On Fri, Nov 7, 2014 at 12:18 AM, Mark Davis ☕️ m...@macchiato.com wrote: Very nice. I'd have one suggestion. People appear to be converging on similar file names for the emoji. - Lowercase hex numbers, - at least 4 digits, - otherwise no leading zeros, - multiple code points separated by _, - with optional prefix/suffix. Like dcm_0030_20e3.png. I'd suggest using that convention. Not a big thing, but makes it more consistent in tooling. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Nov 6, 2014 at 3:27 PM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: I'd like to thank those that helped me a while ago figuring out variants and emoji behavior. Today we are open sourcing a relatively small JS library and 800+ CDN based assets able to bring unified emoji in every WebView capable device and browser. We are also planning to implement the recently introduced diversity for the Unicode 8 draft as soon as we'll figure out a good approach for it ( and btw, the default fallback is great! ) This effort and collaboration is between Twitter [1], MaxCDN [2], and Wordpress [3]. Any comment or suggestion will be more than welcome and appreciated. Thanks again and Best Regards [1] https://blog.twitter.com/2014/open-sourcing-twitter-emoji-for-everyone [2] https://www.maxcdn.com/blog/emojis-ftw/ [3] http://en.blog.wordpress.com/2014/11/06/emoji-everywhere/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
keynote
As an experiment, we recorded the keynote at the Unicode Conference. I posted them at http://macchiati.blogspot.com/2014/11/unicode-emoji.html Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Open Source Emoji for the Web
Very nice. I'd have one suggestion. People appear to be converging on similar file names for the emoji. - Lowercase hex numbers, - at least 4 digits, - otherwise no leading zeros, - multiple code points separated by _, - with optional prefix/suffix. Like dcm_0030_20e3.png. I'd suggest using that convention. Not a big thing, but makes it more consistent in tooling. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Nov 6, 2014 at 3:27 PM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: I'd like to thank those that helped me a while ago figuring out variants and emoji behavior. Today we are open sourcing a relatively small JS library and 800+ CDN based assets able to bring unified emoji in every WebView capable device and browser. We are also planning to implement the recently introduced diversity for the Unicode 8 draft as soon as we'll figure out a good approach for it ( and btw, the default fallback is great! ) This effort and collaboration is between Twitter [1], MaxCDN [2], and Wordpress [3]. Any comment or suggestion will be more than welcome and appreciated. Thanks again and Best Regards [1] https://blog.twitter.com/2014/open-sourcing-twitter-emoji-for-everyone [2] https://www.maxcdn.com/blog/emojis-ftw/ [3] http://en.blog.wordpress.com/2014/11/06/emoji-everywhere/ ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Question about a Normalization test
On Thu, Oct 23, 2014 at 6:54 PM, Aaron Cannon cann...@fireantproductions.com wrote: 0061 05AE 0305 0300 0315 0062 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cu0061+%5Cu05AE+%5Cu0305+%5Cu0300+%5Cu0315+%5Cu0062g=ccc 0305 and 0300 have the same ccc, so the first one blocks the second. http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G49576 The older spec is shorter, although not as precise: http://www.unicode.org/reports/tr15/tr15-29.html#Specification Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
fonts for U7.0 scripts
I'm looking for freely downloadable TTF fonts for any of the following. I'd appreciate links to sites for any of these: 1. Bassa_Vah 2. Duployan 3. Grantha 4. Khojki 5. Khudawadi 6. Mahajani 7. Mende_Kikakui 8. Modi 9. Mro 10. Nabataean 11. Old_Permic 12. Palmyrene 13. Pau_Cin_Hau 14. Tirhuta 15. Warang_Citi Coverage doesn't need to be complete, and the font doesn't need to support shaping (these are just for charts / illustrations). Mark https://google.com/+MarkDavis ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: What happened to...?
I agree that we should minute at least some reason for declining. It need only be a sentence or two. (BTW I wasn't at that discussion.) {phone} On Sep 20, 2014 3:17 AM, Asmus Freytag asm...@ix.netcom.com wrote: On 9/19/2014 5:38 PM, Whistler, Ken wrote: Michael, Declines to take action” is pretty thin. A proposal which is declined by the UTC doesn't automatically create an obligation to write an extended dissertation explaining the rationale and putting that rationale on record. It might be one thing if there were a lot of controversy involved, and one group of participants asked for a rationale to be recorded, despite not having a consensus to move on something -- but this one wasn't even close. Nobody in the committee felt encoding was justified in this case. And not every mark on paper -- not even every mark *printed* in typeset material on paper -- is automatically an obvious candidate for encoding with a simple, plain text character representation. True, but a rationale (note that's not necessarily a dissertation) never hurts. Declines to take action” may look like it is equivalent to Nobody in the committee felt encoding was justified in this case, but it really isn't. The former allows for all sorts of non-substantive reasons, but the latter is pretty clear: the submitter failed to make the case. What you are looking for is something equivalent to summary dismissal of a legal action, but even there this usually gets some rationale or it has the benefit of a standardized legal principle (don't know for a fact, but sounds plausible). A./ --Ken ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: FYI: Ruble sign in Windows
Cool, congratulations! Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Aug 14, 2014 at 3:52 PM, Peter Constable peter...@microsoft.com wrote: For those interested, there is an update for Windows available now to add font, keyboard and locale data support for the Ruble sign that was added in Unicode 7.0. For details, see here: http://support.microsoft.com/kb/2970228 Peter ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: meaningful and meaningless FE0E
These variation selector characters only apply to specific characters, those listed in http://unicode.org/Public/UNIDATA/StandardizedVariants.html There is a machine-readable version at http://unicode.org/Public/UNIDATA/StandardizedVariants.txt Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Sun, Jun 29, 2014 at 8:47 AM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: ok, here the simplified version of my question: would U+1F21A followed by U+FE0E be represented differently from what U+1F21A is normally? is such sequence even a real concern or intent specified anywhere? (no, can't find it, asking just confirmation) Thanks a lot for any outcome! Best Regards On Sat, Jun 28, 2014 at 10:33 AM, Andrea Giammarchi andrea.giammar...@gmail.com wrote: Dear all, this is my first email in this channel so apologies in advance if already discussed. I am trying to understand the expected behavior when there an unexpected VS15 after emoji that have not been defined, accordingly with this file http://www.unicode.org/Public/UNIDATA/NamesList.txt, as VS15 sensitive. My take on FE0E is that all emoji that are sensible to this variant, have an emojified counter part that should be used when followed by FE0F and vice-versa a textual part when followed by FE0E, but all other emoji should not consider such variant at all since there's no textual counter part to represent, let's say, a 1F21A pile-of-poo \ud83d\udca9\ufe0e Can anyone please confirm my expectations are correct so that above sequence in both Java or JavaScript will show the POP emoji regardless, followed by FE0E variant that will be simply ignored and actually no device/OS/render/viewer/browser would ever create such sequence so it's actually a non problem, this one I am trying to solve? Thanks in advance and Best Regards ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Swift
I haven't done any analysis, but on first glance it looks like it is based on http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn s...@maya.com wrote: Has anyone figured out whether character sequences that are non-canonical (de)compositions but could be recomposed to the same result are the same identifier or not? That is: are identifiers merely sequences of characters or intended to be comparable as “Unicode strings” (under some sort of compatibility rule)? On Jun 5, 2014, at 11:27 AM, Martin v. Löwis mar...@v.loewis.de wrote: Am 04.06.14 11:28, schrieb Andre Schappo: The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? The language reference is at https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html For reference, the definition of identifier-character is (read each line as an alternative) identifier-character → Digit 0 through 9 identifier-character → U+0300–U+036F, U+1DC0–U+1DFF, U+20D0–U+20FF, or U+FE20–U+FE2F identifier-character → identifier-head where identifier-head is identifier-head → Upper- or lowercase letter A through Z identifier-head → U+00A8, U+00AA, U+00AD, U+00AF, U+00B2–U+00B5, or U+00B7–U+00BA identifier-head → U+00BC–U+00BE, U+00C0–U+00D6, U+00D8–U+00F6, or U+00F8–U+00FF identifier-head → U+0100–U+02FF, U+0370–U+167F, U+1681–U+180D, or U+180F–U+1DBF identifier-head → U+1E00–U+1FFF identifier-head → U+200B–U+200D, U+202A–U+202E, U+203F–U+2040, U+2054, or U+2060–U+206F identifier-head → U+2070–U+20CF, U+2100–U+218F, U+2460–U+24FF, or U+2776–U+2793 identifier-head → U+2C00–U+2DFF or U+2E80–U+2FFF identifier-head → U+3004–U+3007, U+3021–U+302F, U+3031–U+303F, or U+3040–U+D7FF identifier-head → U+F900–U+FD3D, U+FD40–U+FDCF, U+FDF0–U+FE1F, or U+FE30–U+FE44 identifier-head → U+FE47–U+FFFD identifier-head → U+1–U+1FFFD, U+2–U+2FFFD, U+3–U+3FFFD, or U+4–U+4FFFD identifier-head → U+5–U+5FFFD, U+6–U+6FFFD, U+7–U+7FFFD, or U+8–U+8FFFD identifier-head → U+9–U+9FFFD, U+A–U+AFFFD, U+B–U+BFFFD, or U+C–U+CFFFD identifier-head → U+D–U+DFFFD or U+E–U+EFFFD As the construction principle for this list, they say Identifiers begin with an upper case or lower case letter A through Z, an underscore (_), a noncombining alphanumeric Unicode character in the Basic Multilingual Plane, or a character outside the Basic Multilingual Plan that isn’t in a Private Use Area. After the first character, digits and combining Unicode characters are also allowed. Regards, Martin ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Swift
Apparently you can use emoji in the identifiers. ( http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/ ) Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo a.scha...@lboro.ac.uk wrote: Swift is Apple's new programming language. In Swift, variable and constant names can be constructed from Unicode characters. Here are a couple of examples from Apple's doc http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html let π = 3.14159 let 你好 = 你好世界 I think this a huge step forward for i18n and Unicode. There are some restrictions on which Unicode chars can be used. From Apple's doc Constant and variable names cannot contain mathematical symbols, arrows, private-use (or invalid) Unicode code points, or line- and box-drawing characters. Nor can they begin with a number, although numbers may be included elsewhere within the name. The restrictions seem a little like IDNA2008. Anyone have links to info giving a detailed explanation/tabulation of allowed and non allowed Unicode chars for Swift Variable and Constant names? André Schappo ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 10:32 PM, David Starner prosfil...@gmail.com wrote: Why? It seems you're changing the rules ... This isn't are changing, it is has changed. The Corrigendum was issued at the start of 2013, about 16 months ago; applicable to all relevant earlier versions. It was the result of fairly extensive debate inside the UTC; there hasn't been a single issue on this thread that wasn't considered during the discussions there. And as far back as 2001, the UTC made it clear that noncharacters *are* scalar values, and are to be converted by UTF converters. Eg, see http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance, one day before 9/11). probably trigger serious bugs in some lamebrained utility. There were already plenty of programs that passed the noncharacters through; very few would filter them (some would delete them, which is horrible for security). Thinking that a utility would never encounter them in input text was a pipe-dream. If a utility or library is so fragile that it *breaks* on input of any valid UTF sequence, then it *is* a lamebrained utility. A good unit test for any production chain would be to check there is no crash on any input scalar value (and for that matter, any ill-formed UTF text). ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Tue, Jun 3, 2014 at 9:41 AM, David Starner prosfil...@gmail.com wrote: Thinking that a utility would never mangle them if encountered in input text was a pipe-dream. I didn't say not mangle, I said break, as in crash. I don't think this thread is going anywhere productive, so I'm signing off from it. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode Regular Expressions, Surrogate Points and UTF-8
\uD808\uDF45 specifies a sequence of two codepoints. That is simply incorrect. In Java (and similar environments), \u means a char (a UTF16 code unit), not a code point. Here is the difference. If you are not used to Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x with the replacement y in string. Backslashes in literals need escaping, so \x needs to be written in literals as \\x. String[] tests = {\\x{12345}, \\uD808\\uDF45, \uD808\uDF45, «.»}; String target = one: «\uD808\uDF45»\t\t + two: «\uD808\uDF45\uD808\uDF45»\t\t + lead: «\uD808»\t\t + trail: «\uDF45»\t\t + one+: «\uD808\uDF45\uD808»; System.out.println(pattern + \t→\t + target + \n); for (String test : tests) { System.out.println(test + \t→\t + target.replaceAll(test, §︎)); } *Output:* pattern → one: «⍅» two: «⍅⍅» lead: «?» trail: «?» one+: «⍅?» \x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» \uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» ⍅ → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?» «.» → one: §︎ two: «⍅⍅» lead: §︎ trail: §︎ one+: «⍅?» The target has various combinations of code units, to see what happens. Notice that Java treats a pair of lead+trail as a single code point for matching (eg .), but also an isolated surrogate char as a single code point (last line of output). Note that Java's regex in addition allows \x{hex} for specifying a code point explicitly. It also has the syntax \u (in a literal the \ needs escaping) to specify a code unit; that is slightly different than the Java preprocessing. Thus the first two are equivalent, and replace { by x. The last two are also equivalent—and fail—because a single { is a broken regex pattern. System.out.println({.replaceAll(\\u007B, x)); System.out.println({.replaceAll(\\x{7B}, x)); System.out.println({.replaceAll(\u007B, x)); System.out.println({.replaceAll({, x)); Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sun, 1 Jun 2014 08:58:26 -0700 Markus Scherer markus@gmail.com wrote: You misunderstand. In Java, \uD808\uDF45 is the only way to escape a supplementary code point, but as long as you have a surrogate pair, it is treated as a code point in APIs that support them. Wasn't obvious that in the following paragraph \uD808\uDF45 was a pattern? Bear in mind that a pattern \uD808 shall not match anything in a well-formed Unicode string. \uD808\uDF45 specifies a sequence of two codepoints. This sequence can occur in an ill-formed UTF-32 Unicode string and before Unicode 5.2 could readily be taken to occur in an ill-formed UTF-8 Unicode string. RL1.7 declares that for a regular expression engine, the codepoint sequence U+D808, U+DF45 cannot occur in a UTF-16 Unicode string; instead, the code unit sequence D808 DF45 is the codepoint sequence U+12345 CUNEIFORM SIGN URU TIMES KI. (It might have been clearer to you if I'd said '8-bit' and '16-bit' instead of UTF-8 and UTF-16. It does make me wonder what you'd call a 16-bit encoding of arbitrary *codepoint* sequences.) Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele shawn.ste...@microsoft.com wrote: I also think that the verbiage swung too far the other way. Sure, I might need to save or transmit a file to talk to myself later, but apps should be strongly discouraged for using these for interchange with other apps. Interchange bugs are why nearly any news web site ends up with at least a few articles with mangled apostrophes or whatever (because of encoding differences). Should authors’ tools or feeds or databases or whatever start emitting non-characters from internal use, then we’re going to have ugly leak into text “everywhere”. So I’d prefer to see text that better permitted interchange with other components of an application’s internal system or partner system, yet discouraged use for interchange with “foreign” apps. -Shawn ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
I disagree with that characterization, of course. The recommendation for libraries and low-level tools to pass them through rather than screw with them makes them usable. The recommendation to check for noncharacters from unknown sources and fix them was good advice then, and is good advice now. Any app where input of noncharacters causes security problems or crashes is and was not a very good app. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote: On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. And threw the baby out to fix it. A./ Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode Regular Expressions, Surrogate Points and UTF-8
I think you have a point here. We should probably change to: To meet this requirement, an implementation shall supply a mechanism for specifying any Unicode scalar value (from U+ to U+D7FF and U+E000 to U+10), using the hexadecimal code point representation. and then in the notes say that the same notation can be used for codepoints that are not scalar values, for implementation that handle them in Unicode strings. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Fri, May 30, 2014 at 8:45 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: Is there any good reason for UTS#18 'Unicode Regular Expressions' to express its requirements in terms of codepoints rather than scalar values? I was initially worried by RL1.1 requiring that one be able to specify surrogate codepoints in a pattern. It would not be compliant for an application to reject such patterns as syntactically or semantically incorrect! RL1.1 seemed to prohibit compliant regular expression engines that only handled well-formed UTF-8 strings. Furthermore, consider attempting to handle CESU-8 text as a sequence of UTF-8 code units. The code unit sequence for U+1 will, corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80 ED B0 80. If one follows the lead of the 'best practice' for processing ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0, and 80. I am not aware of any recommendation as to how to interpret these sequences as codepoints. While being able to specify a search for surrogate codepoint U+D800 might be useful when dealing with ill-formed UTF-16 Unicode sequences, UTS#18 Section 1.7, which discusses requirement RL1.7, states that there is no requirement for a one-codepoint pattern such as \u{D800} to match a UTF-16 Unicode string consisting just of one code unit with the value 0xD800. The convenient, possibly intended, consequence of this is that the RL1.1 requirement to allow patterns to specify surrogate codepoints can be satisfied by simply treating them as unmatchable; For example, such a 1-character RE could be treated as the empty Unicode set [\p{gc=Lo} \p{gc=Mn}]. Now, I suppose one might want to specify a match for ill-formed (in context) UTF-8 code unit subsequences such as E0 80 (not a valid initial subsequence) and E0 A5 (lacking a trailing byte), but as matching is not required, I don't see the point in UTS#18 being changed to ask for an appropriate syntax to be added. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Long-Encoded Restricted Characters in High Frequency Modern Use
Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Fri, May 30, 2014 at 12:39 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: I am a little confused by the call for a review of UTS #39, Unicode Security Mechanisms (PRI #273). Are we being requested to report long-encoded 'restricted' characters in high frequency modern use? 'Restricted' refers to the classification in xidmodifications.txt. First, restricted are meant not for everyday use, but specifically just for the purpose of programming identifiers and similar sorts of identifiers. Moreover, it sets up a framework, but the conformance requirements are only that any modification is declared. http://www.unicode.org/reports/tr39/proposed.html#C1 You may know this all, but just to be sure. One linked pair of long-encoded restricted characters in high frequency use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM, which occurs in the extremely common Thai and Lao words for 'water' or 'liquid in general' น้ำ ນ້ຳ whose NFKC decompositions are the nonsensical forms น้ํา ນ້ໍາ, but may be faked by the linguistically incorrect นํ้า ນໍ້າ. In Thai the encodings are U+0E19 THAI CHARACTER NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI CHARACTER SARA AM, U+0E19, U+0E49, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32 THAI CHARACTER SARA AA and U+0E19, U+0E49, U+0E4D, U+0E49, U+0E32. The structure of the data is based on the use of NFKC characters in identifiers. So SARA AM and the Lao equivalent are both not NFKC characters, and are categorized as such, and would need to be represented by their NFKC fors. The process is in http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection You can see the categorization (for 6.3) for a whole script with a link like: http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restrictiona=\p{sc=thai} (It only works for 6.3 right now, but these items haven't changed recently.) Now, U+0E4D THAI CHARACTER NIKHAHIT is classified as 'allowed; recommended', although its main use is in writing Pali, which would suggest that it should be 'restricted; historic' or 'restricted; limited-use'. For that, it would be best to submit via http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a feedback form at http://www.unicode.org/reporting.html, just to be sure. The situation is not so clear for Lao - U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language. Based on your information, the following appear (at least to me) to be caused by typos in in the xidmodifications source files; they are all marked as 'technical'. http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restrictiona=\p{sc=khmer} Again, best to submit this like above (via http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a feedback form at http://www.unicode.org/reporting.html). To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as 'restricted; technical'. They are all in use in the Khmer language. U+17CB KHMER SIGN BANTOC is required for the main methods of writing the Khmer vowels /a/ and /ɑ/. U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn that it has recently become little-used. It is, however, readily confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main modern use is to show that a consonant is silent, rather like the Thai letter U+0E4C THAI CHARACTER THANTHAKHAT. (The names are the same.) The confusion arises because Sanskrit -rCa was pronounced /-r/ in Khmer, and final /r/ recently became silent in Khmer, so the effect of the Sanskrit /r/ is now to silence the final consonant. While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be common, they are still in modern use. Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in frequency, it has not dropped out of use and is still a common enough way of writing the vowel /a/. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Corrigendum #9
A few quick items. (I admit to only skimming your response, Phillipe; there is only so much time in the day.) Any discussion of changing non-characters is really pointless. See http://www.unicode.org/policies/property_value_stability_table.html As to breaking up the block, that is not forbidden: but one would have to give pretty compelling arguments that the benefits would outweigh any likely problems, especially since we already don't recommend the use of the block property in regexes. And regular expressions trying to use character properties have many more caveats to handle (the most serious being with canonical equivalences and discontinuous matches or partial matches. The UTC, after quite a bit of work, concluded that it was not feasible with today's regex engines to handle normalization automatically, instead recommending the approach in http://www.unicode.org/reports/tr18/#Canonical_Equivalents Regexps are still a very experimental proposal, they are still very difficult to make interoperatable except in a small set of tested cases I have no idea where this is coming from. Regexes using Unicode properties are in widespread and successful use. It is not that hard to make them interoperable (as long as both implementations are using the same version of Unicode). Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Sat, May 31, 2014 at 9:36 PM, Philippe Verdy verd...@wanadoo.fr wrote: May be; but there's real doubt that a regular expression that would need this property would be severely broken if that property was corrected. There are many other properties that are more useful (and mich more used) whose associated set of codepoints changes regularly across versions. I don't see any specific interest in maintaining non-characters in that block, as it effectively reduces the reusaibility of this property. And in fact it would be highly preferable to no longer state that these non-characters in ArabicPresenationForm be treated like C1 controls or PUA (because they will ever be reassigned to something more useful). Making them PUA would not change radically the fact thzt these characters are not recommended but we xould no longer bother about checking if they are valid or not. They remain there only as a legacy with old outdated versions of Unicode for a mysterious need that Ive not clearly identified. Let's assume we change them into PUA; some applications will start accepting them when some other won't. Not a problem given that they are already not interoperable. And regular expressions trying to use character properties have many more caveats to handle (the most serious being with canonical equivalences and discontinuous matches or partial matches; when searches are only focuing on exact sets of code points instead of sets of canonical equivalent texts; the other complciation coming with the effect of collation and its variable strength matching more or less parts of text spanning ignorable collation elements i.e, possibly also, discontinuous runs of ignorable codepoints if we want to get consistant results independant of th normalization form. more compicate is how to handle partial matches such as a combining character within a precomposed character which is canonically equivalent to string where this combining character appears And even more tricky is how to handle substitution with regexps, for example when perfrming search at primary collation level ignoring lettercase, but when we wnt to replace base letters but preserve case in the substituted string: this requires specific lookup of characters using properties **not** specified in the UCD but in the collation tailoring data, and then how to ensure that the result of the substitution in the plain-text source will remain a valid text not creating new unexpected canonical equivalences, and that it will also not break basic orthographic properties such as syllabic structures in a specific pair of language+script, and without also producing unexpected collation equivalents at the same collation strength; causing later unexpected never ending loops of subtitutions, for example in large websites with bots operating text corrections). Regexps are still a very experimental proposal, they are still very difficult to make interoperatable except in a small set of tested cases and for this reason I really doubt that the characetrs encoding block property is very productive for now with regexps (and notably not with this compatibility block, whose characters wll remain used isolately independantly of their context, if they are still used in rare cases). I see little value in keeping this old complication in this block, but just more interoperability problems for implementations. So these non characters should be treated mostly like PUA, except that they have a few more properties : direction=RTL, script= Arabic, and starters working in isolation for the Arabic
Re: Unicode Sets in 'Unicode Regular Expressions'
They are defined in http://unicode.org/reports/tr35/tr35.html#Unicode_Sets. We should add a pointer to that; could you please file a feedback report for #18 to that effect? Also, if you find any problems in the description in #35, you can file a ticket at http://unicode.org/cldr/trac/newticket to get them addressed. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, May 28, 2014 at 12:18 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3 'Subtraction and Intersection' talks of Unicode sets. What is the relevant definition of a 'Unicode set'? Is it a finite set of non-empty strings? Other possibilities that occur to me, depending on context, include sets of codepoints and sets of indecomposable codepoints. Richard. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: ID_Start, ID_Continue, and stability extensions
On 25 April 2014 20:53, Karl Williamson pub...@khwilliamson.com wrote: And in fact in some Unicode releases, they contained errors. I think you know this, but for others. A derived property value in the UCD is defined by the value in the derived data file, NOT by the derivation. Of course, the value might not follow the intent, just with any other property, and there are fixes to properties, whether derived or not, in each release. And sometimes the statement of the derivation is changed, and sometimes property values are changed. And the regex recommendations in http://www.unicode.org/reports/tr18/#Compatibility_Properties are different, so you may be referring to them. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unclear text in the UBA (UAX#9) of Unicode 6.3
We try not to do that. There are some known holes, like RBNF. if you know of others please file a ticket. {phone} On Apr 21, 2014 9:18 PM, Doug Ewell d...@ewellic.org wrote: From: Asmus Freytag asmusf at ix dot netcom dot com wrote: In general, I heartily dislike specifications that just narrate a particular implementation... I agree completely. I see this with CLDR as well; there is a more or less implicit assumption that I will be using ICU to implement whatever is being described. I don't care how robust and well-tested a wheel is; as a developer, I should be able to use the specification to reinvent it if I like. -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Updated emoji working draft
On 15 April 2014 13:14, William_J_G Overington wjgo_10...@btinternet.comwrote: If the UTC (Unicode Technical Committee) accepts the introduction of read-out labels, each read-out label both linked to a pictograph character and also linked to a language-localization text string, then that will be a far-reaching enhancement to Unicode which may have enormous implications for facilitating communication through the language barrier. If the UTC (Unicode Technical Committee) accepts the introduction of read-out labels The passage just points out that those can exist, the document does not provide any data for that. If there were on the webpage emoji for Surname, Forename, Delivery address, Card number I can't see any possible future in which emoji like that are encoded. As I said before, please move this discussion to another email subject. Otherwise, I'll take a step I should have long ago, and simply filter out all email coming from you. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Updated emoji working draft
This is really off topic. If you want to start up a thread about this, please use a different subject. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On 14 April 2014 16:01, William_J_G Overington wjgo_10...@btinternet.comwrote: Here are two examples each of a symbol together with accompanying text in Venice. The symbol is global and the text is local. https://maps.google.com/maps?q=Venice,+Italyhl=enll=45.432399,12.337928spn=0.000702,0.001124sll=37.0625,-95.677068sspn=26.039016,36.826172oq=venicehnear=Venice,+Veneto,+Italyt=mlayer=ccbll=45.432473,12.337638panoid=YazHmOmqVm1q5CZ2H7klMQcbp=12,16.36,,0,8.23z=19 Going full screen and zooming-in is helpful. William Overington 14 April 2014 ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Updated emoji working draft
On 12 April 2014 11:46, William_J_G Overington wjgo_10...@btinternet.comwrote: ... In March 2014 I published the attached document, depositing a copy with the British Library. The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf Is this format suitable to become standardized for use in producing localized text-to-speech from emoji to the chosen local language? no, not particularly Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Updated emoji working draft
On 12 April 2014 16:54, William_J_G Overington wjgo_10...@btinternet.comwrote: Would it be good, for an emoji that is not encoded in regular Unicode, to include mention of the possibility of transmission by markup bubble, rendered upon reception as an unmapped glyph by an OpenType colour font? For example, as nine Unicode characters. COLON COLON U1 U2 U3 U4 U5 COLON SEMICOLON This would perhaps not always allow new emoji to be added as quickly as with embedded graphics, yet with this technique, the message could be archived as plain text and would be searchable and text-to-speech would be possible at the receiving end. I don't think anything like what you suggest would be feasible, or desirable. Longer term, I think the most feasible approach is the interchange of embedded graphics, which can always have alt values (at least in html) for readings. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Bidi reordering of soft hyphen
I tend to agree with Roozbeh and Behdad. I would expect to find the visible appearance of the hyphen replacing the letters that were broken off from the last word. That is, if the word was beekeeper, I'd expect to see: bee- . That would be no matter where the word occurred, and no matter what the direction of the paragraph or surrounding text. (If the SHY occurred at a directional boundary, I'd also say we don't care much...) In any event, once we come up with an agreed recommendation, I'd suggest an implementation note like Asmus describes, but rather than talk about algorithmic steps, just point out the desired visual behavior (since there are many ways to do it). Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On 1 April 2014 23:43, Asmus Freytag asm...@ix.netcom.com wrote: I think this calls for an implementation note on UAX#9 along these lines. - During line breaking, if a line is broken at the location of a SHY, the text around the line break may change. A common case is the replacement of the invisible SHY by a visible HYPHEN, but see Section x.x in the Unicode Standard. For the purposes of the Bidi Algorithm, apply steps .. to .. after any substitutions have been made, using the directional classes for the substituted characters, instead of a single BN for the SHY character. example Note, no special action need be taken for a SHY character in the middle of a line, unless they are rendered as visible glyphs in a show hidden character mode. In the latter case, the recommendation would be to treat the visible symbol substituted for the SHY as having bidi class ON. I am not sure whether -car CBA or car- CBA is the right answer, nor whether the substitution will always be limited to the preceding line. (Old orthography German had BäcSHYker turning in to Bäk-|ker, where I've used | to show the line ending.) Those are details that the UBA should be ignorant about. The important thing is that the array of bidi directional classes is not constrained to contain a single entry for BN at the location of the original SHY. If car- CBA is the right answer then the substitution would have to be HYPHEN plus LRM to get this to come out right, but that would be under the control of the line-breaking conventions, and not legislated by the UBA. A./ On 4/1/2014 1:31 PM, Whistler, Ken wrote: Richard Wordingham noted: As U+2010 HYPHEN would result in text like 'car-', in an English influenced context I would also go with 'car-'. That's always a possibility, I suppose, but I'm not sure what English influenced context means here. The examples I just gave were for a RTL paragraph context. In a LTR paragraph context, the same input would end up in a very different order: Trace: Entering br_UBA_ReverseLevels [L2] Current State: 19 Text:05D0 05D1 05D2 0020 0063 0061 0072 002D Bidi_Class: RRRLLLLL Levels: 11100000 Runs:L---L Order: [2 1 0 3 4 5 6 7] And you get the display: CBA car- - As opposed to: -car CBA - In either case, the hyphen-minus (or hyphen), ends up at the *end of the line*. My take is that *if* I am going to insert a visible glyph at the point of the SHY, it would probably be best to insert it at the actual line break at the end of the line, to be in the same position as an explicit hyphen-minus with the same line break. --Ken ___ Unicode mailing listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
FYI: More emoji from Chrome
More emoji from Chrome: http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: FYI: More emoji from Chrome
Yup! Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On 1 April 2014 09:13, Philippe Verdy verd...@wanadoo.fr wrote: April 1st joke... 2014-04-01 9:01 GMT+02:00 Mark Davis ☕️ m...@macchiato.com: More emoji from Chrome: http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters (Was: (in 6429) in allkeys.txt)
They do have aliases in NameAliases.txt ;NULL;control ;NUL;abbreviation 0001;START OF HEADING;control 0001;SOH;abbreviation 0002;START OF TEXT;control 0002;STX;abbreviation ... Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, Mar 12, 2014 at 1:32 PM, Per Starbäck starb...@stp.lingfil.uu.sewrote: Ken Whistler wrote: Ah, I see what the interpretation problem was. Yes, that is a straightforward kind of improvement -- easily enough done. Look for a change the next time the file is updated. (It will not be immediately changed, pending other review comments.) Thanks! Then I'll skip making a formal request about this. Regarding these names in ISO 6429 again, how come these control characters don't have Unicode names? For many uses of names, the control characters have as much need for them as any other character. Since it seems so straightforward it must have been suggested several times to introduce names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc., so I assume there are good reasons for not doing that, but I can't see what they are. Since applications want names they will use other things as names when there isn't a real name, and that leads to problems. Take Emacs where the command describe-char currently describes U+0007 as name: control old-name: BELL (I reported the misusage of control here as a name in 2009, but it wasn't fixed until this year, so still not in a released version.) The usage of BELL here invites confusion with U+1F514 BELL. Emacs should do better regarding this, but still, with a proper name all of this would have been averted. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: NFD - NFC
Not sure about your exact case, but ICU's normalization does handle those characters. http://unicode.org/cldr/utility/transform.jsp?a=nfc%3Bhexb=%5Cu30B9%5Cu3099 (That tool uses ICU for NFC). Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Tue, Mar 11, 2014 at 4:50 PM, Markus Doppelbauer doppelba...@gmx.netwrote: Hello, I have an other problem making the normalization process binary compatible with ICU. Why does 30B9 3099 not combine to 30BA? Steps to reproduce: wget http://doppelbauer.name/katakana.txt uconv -f utf8 -t utf8 -x nfd katakana.txt ndf.txt uconv -f utf8 -t utf8 -x nfc ndf.txt nfc.txt diff katakana.txt nfc.txt Expected result: katakana.txt == nfc.txt uconv v2.1 ICU 4.8.1.1 Thanks a lot Markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Unicode organization is still anti-Serbian and anti-Macedonian
Unicode is not anti-Serbian or Macedonian. The exact level of Unicode support will depend on your operating system and font choice. For example, on the Mac there are reasonable results with arbitrary accents. Here are examples with q,U+0308 and Q,U+0308 q̈ Q̈ Here is an image, in case your emailer or OS doesn't handle these well. [image: Inline image 1] See also http://www.unicode.org/standard/where/ As to the italic, that also depends on the font support on your system. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Fri, Feb 14, 2014 at 2:37 AM, Крушевљанин pe...@muchomail.com wrote: There is still problem with letters бгдпт in italic, and б in regular mode. OpenType support is still very weak (Firefox, LibreOffice on Linux, Adobe's software and that's it, practically). It's also disappointing that Microsoft is still incapable to implement and force this support on system level. Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7 types × 6 possible letters = 42 combinations) where majority of them don't exist precomposed, and is impossible to enter them. A lot of nowadays' fonts (even commercial) still have issues with accents. In Unicode, Latin scripts are always favored, which is simply not fair to the rest of the world. They have space to put glyphs for dominoes, a lot of dead languages etc. but they don't have space for real-world issues. I want Unicode organization to change their politics and pay attention to small countries like Serbia and Macedonia. We have real-world problems. Thank you. If you think these are biases of me, I say — real-world problem for us. If you think changes would invalidate existing texts, I say — no, because *real* Serbian/Macedonian support still doesn't exist! And we can develop converters in the future, so I don't see any huge cost problems... -- Крушевљанин Иван _ The Free Email with so much more! = http://www.MuchoMail.com = ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode inline: Screen Shot 2014-02-14 at 12.56.52.png___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: CJK IDS database
Boy, I'd forgotten about those. There is an open-source collection of IDSs that I used to create those files. Unfortunately, I found that *that* data would take a lot of cleanup. I do agree that it would be very useful to have an open-source repository of IDSs for Unicode characters, but I don't know of one. Others? Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, Jan 15, 2014 at 4:36 AM, Michel Suignard mic...@suignard.comwrote: I guess you should ask the owner, our distinguished president. Michel *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Andrew Pantyukhin *Sent:* Tuesday, January 14, 2014 4:06 PM *To:* unicode@unicode.org *Subject:* CJK IDS database Hi! I find Ideographic Description Sequences massively useful for studying and describing Chinese characters. However, I found only one comprehensive source of them — http://macchiato.com/ids/ Does anyone know where the files come from? Were they part of the IRG process, or just an isolated effort? What are the private use characters in the sequences? I'd like to contribute to the IDS database and incorporate it into products like wiktionary and rikaikun. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Language Death
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0077056 with a popular article at http://www.washingtonpost.com/blogs/worldviews/wp/2013/12/04/how-the-internet-is-killing-the-worlds-languages/ The source article was interesting, although I'd take issue with some of their methodology. The WP gloss takes some liberties; in particular, the source says The latest (2012/02/28) publicly available version of the [SIL] database distinguishes 7,776 languages while the WP leaps to the conclusion that …at least 7,776 languages are in use in the greater offline world. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —*
Re: Best practice of using regex on identify none-ASCII email address
These are two well-known serious flaws in EAI and URLs; there is no useful syntactic limit on what is in the query part of a URL or on the local part of an email address that would allow their boundaries to be detected in plaintext. No use complaining about them, because people are concerned with backwards compatibility, and wouldn't change the underlying specs. That being true, I wish that industry could come to consensus about requiring everything outside of a well-defined, backwards-compatible set of characters to be expressed as UTF-8 percent-escaped characters in these fields when they are expressed as plaintext. (Something like XID_Continue ± exceptions.) That would allow for unambiguous parsing in plaintext. Mark https://google.com/+MarkDavis * * *— Il meglio è l’inimico del bene —* ** On Thu, Oct 31, 2013 at 8:37 PM, Philippe Verdy verd...@wanadoo.fr wrote: How can it suarprisingly work if you need to safely embed an email address as an URI in a plain text document ? Yes there's way to worak with the IDNA part, but the local part is a challenge, that will require (to make it work) that the mail server will accept several aliased account names, depending on the document in which the address was embedded and encoded before being dereferenced and used to send mails. There's no easy way to embed the local part in plain-text when it can be arbitrary sequences of bytes in the non-ASCII range, whose encoding in the target domain name is unpredictable without first querying the MX server for that domain for this info, or without retrying sending mails with several guesses: these guesses with retries may cause privacy issues for the legitimate owner of non-ASCII email accounts (another reasons for using email of verification/confirmation of the owner, before sending him private messages). 2013/10/31 Shawn Steele shawn.ste...@microsoft.com I think that’s true for non-ASCII non-EAI locale parts as well. It’s so inconsistent its surprising when it works?
Re: Best practice of using regex on identify none-ASCII email address
I'm not saying that what is sent to the server has to be those bytes; I'm saying that if we use the convention that punctuation, whitespace, etc gets escaped, it would allow us to recognize the boundaries of the local part in plain text. I think what you mention is part of a more general problem. Let's suppose that I have an email address where the bytes that the server recognizes for the local part are 61 B3@foo.com. I convert that using Latin-14 to aġ@ foo.com. I send it in an email to you, and you receive it as UTF-8. You see aġ@foo.com, but underneath the covers it is bytes 61 C4 A1. But then you send to the server 61 C4 A1@foo.com, and it fails. Or worse yet, reaches someone whose email is aÄ¡@foo.com. (Ok, I could have poked around and found a more compelling example, but you see the point). If I really wanted to be absolutely certain that my email wouldn't be munged by a conversion, I'd never convert from bytes: we'd never see m...@foo.com, we'd always see the equivalent of %6d%61%72...@foo.com. Mark https://google.com/+MarkDavis * * *— Il meglio è l’inimico del bene —* ** On Fri, Nov 1, 2013 at 1:36 PM, Philippe Verdy verd...@wanadoo.fr wrote: 2013/11/1 Mark Davis ☕ m...@macchiato.com These are two well-known serious flaws in EAI and URLs; there is no useful syntactic limit on what is in the query part of a URL or on the local part of an email address that would allow their boundaries to be detected in plaintext. No use complaining about them, because people are concerned with backwards compatibility, and wouldn't change the underlying specs. That being true, I wish that industry could come to consensus about requiring everything outside of a well-defined, backwards-compatible set of characters to be expressed as UTF-8 percent-escaped characters in these fields when they are expressed as plaintext. (Something like XID_Continue ± exceptions.) That would allow for unambiguous parsing in plaintext. Why UTF-8 only ? There exists already email accounts created with various ISO8859-* or windows codepages, or KOI-8R (or U). And none of these addresses are aliased with an UTF-8 encoded account name reaching the same mailbox (creting these aliases would help these users having such accounts to protect their privacy, however there may exist rare cases where these aliases woulda conflict with distinct mail accounts
Re: full-width Latin missing from confusables data
FYI, I just submitted a doc to the UTC for the upcoming meeting: #36 #39 Recommendations http://goo.gl/NKeRVB If there is any feedback you'd like me to incorporate in a revision before the meeting, please let me know. Mark Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 15, 2013 at 8:53 PM, Mark Davis ☕ m...@macchiato.com wrote: but as Michel mentioned the data does not seem consistent in that case. You might add that to your report... Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 15, 2013 at 7:23 PM, Chris Weber ch...@lookout.net wrote: On 10/14/2013 12:40 AM, Mark Davis ☕ wrote: For the confusables, the presumption is that implementations have already either normalized the input to NFKC or have rejected input that is not NFKC. Thanks for the explanation Mark. It makes sense for implementations which want to detect confusability, but as Michel mentioned the data does not seem consistent in that case. Another case could be implementations which want to generate confusable strings for testing - do you think those could be improved by having this extra data? For example: http://unicode.org/cldr/utility/confusables.jsp?a=mr=None It would probably be worth clarifying this in the text of http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an upcoming UTC meeting at the start of Nov., so if you want to suggest that or any other improvements, you should use the http://www.unicode.org/reporting.html. Thank you, I'll file a report. -- Best regards, Chris Weber - ch...@lookout.net - http://www.lookout.net PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7
Re: Terminology question re ASCII
Normally the term ASCII just refers to the 7-bit form. What is sometimes called 8-bit ASCII is the same as ISO Latin 1. If you want to be completely clear, you can say 7-bit ASCII. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote: Quick question on terminology use concerning a legacy encoding: If one refers to plain ASCII, or plain ASCII text or ... characters, should this be taken strictly as referring to the 7-bit basic characters, or might it encompass characters that might appear in an 8-bit character set (per the so-called extended ASCII)? I've always used the term ASCII in the 7-bit, 128 character sense, and modifying it with plain seems to reinforce that sense. (Although plain text in my understanding actually refers to lack of formatting.) Reason for asking is encountering a reference to plain ASCII describing text that clearly (by presence of accented characters) would be 8-bit. The context is one of many situations where in attaching a document to an email, it is advisable to include an unformatted text version of the document in the body of the email. Never mind that the latter is probably in UTF-8 anyway(?) - the issue here is the terminology. TIA for any feedback. Don Osborn Sent via BlackBerry by ATT
Re: full-width Latin missing from confusables data
but as Michel mentioned the data does not seem consistent in that case. You might add that to your report... Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 15, 2013 at 7:23 PM, Chris Weber ch...@lookout.net wrote: On 10/14/2013 12:40 AM, Mark Davis ☕ wrote: For the confusables, the presumption is that implementations have already either normalized the input to NFKC or have rejected input that is not NFKC. Thanks for the explanation Mark. It makes sense for implementations which want to detect confusability, but as Michel mentioned the data does not seem consistent in that case. Another case could be implementations which want to generate confusable strings for testing - do you think those could be improved by having this extra data? For example: http://unicode.org/cldr/utility/confusables.jsp?a=mr=None It would probably be worth clarifying this in the text of http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an upcoming UTC meeting at the start of Nov., so if you want to suggest that or any other improvements, you should use the http://www.unicode.org/reporting.html. Thank you, I'll file a report. -- Best regards, Chris Weber - ch...@lookout.net - http://www.lookout.net PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7
Re: full-width Latin missing from confusables data
For the confusables, the presumption is that implementations have already either normalized the input to NFKC or have rejected input that is not NFKC. More broadly, in gathering data the main emphasis is on characters that fit the profile in http://www.unicode.org/reports/tr39/#Identifier_Characters, including scripts like Cyrillic ( http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). So while we do add characters outside of that, there has been no concerted effort to do so. In particular, in your identifiers you should not allow scripts like Buginese ( http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers) or Lisu (http://www.unicode.org/reports/tr31/#Table_Limited_Use_Scripts) without recognizing that the confusable data will be sketchy for those. It would probably be worth clarifying this in the text of http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an upcoming UTC meeting at the start of Nov., so if you want to suggest that or any other improvements, you should use the http://www.unicode.org/reporting.html. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sun, Oct 13, 2013 at 7:36 PM, Chris Weber ch...@lookout.net wrote: While looking closer at the current confusables data, I've noticed that several of the fullwidth code points seem to be missing from the confusables data. For example, U+FF4D FULLWIDTH LATIN SMALL LETTER M does not exist as a confusable for U+006D LATIN SMALL LETTER M, as well as several others I've noticed. Was this intentional? Also, I'm not clear on the difference between the confusables.txt and confusablesSummary.txt - are these meant to provide the same data in different formats? -- Best regards, Chris Weber - ch...@lookout.net - http://www.lookout.net PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7
Re: More additional Greek (and Hebrew) characters needed for proposal
http://www.unicode.org/faq/char_combmark.html#9 and following. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sat, Sep 21, 2013 at 7:38 PM, Robert Wheelock rwhlk...@gmail.com wrote: Hello again, y’all! I’ve got quite a few characters (currently missing) that DO need proposal for inclusion! I typed up a document (for the new Fontboard polytonic Greek/Coptic keyboard layouts) that list the Unicode hexidecimal numerical values for the polytonic/monotonic Greek precomposed characters, and found out that (at least) 17 vowel/accent combos are still missing: H-C IOTA and UPSILON with both DIALYTIKA and ACCENTS (8 precomposed characters) H-C ALPHA, ĒTA, and ŌMEGA with both PROSGEGRAMMENĒ and ACCENTS (9 precomposed characters). Besides those, there’re accented consonants that also need encoding—ZĒTA and SIGMA with DIALYTIKA (H-C/L-C), GAMMA with TILDAS, GAMMA; KAPPA; and KHI with OVERDOT, KAPPA; PI; TAU with TILDAS, LAMBDA; MU; NU with both PSILI and DASEIA, LAMBDA; MU; NU; and RHŌ with UNDERRING, ... . As far as Hebrew is concerned, we NEED these new characters encoded: WAW with a TRUE SHURUQ (the inner dot positioned a bit higher than a DAGHESH or a MAPPIQ) The same (above mentioned) WAW-TRUE SHURUQ with a DAGHESH added WAW with both a ḪOLAM atop and a DAGHESH inside Doubly-pointed SHIN letters—a plain one + one with a DAGHESH added MEM SOFITH with a right-positioned ḪIRIQ ḪAṬAFOTH vowel points—each with SILLUQ/METHEGH interjected within KHAF SOFITH and FEʾ SOFITH with RAFEH (especially for Yiddish) GHIMEL; DHALETH; and THAW with RAFEH CHIMEL; ĹAMEDH; and ÑUN with VARIQAʾ (especially for Ladino) BENT LAMEDH—plain, with ḪOLAM, with DAGHESH, and with both DAGHESH + ḪOLAM YUDH-WAW ligature GALGAL HAFUKH accent (especially for Yiddish) GIMEL; DALETH; ZAYIN; ṬETH; LAMEDH; NUN; SAMEKH; ʿAYIN; and REʾSH with GALGAL HAFUKH (for Yiddish palatal consonanats and the /e/ vowel sound) An assortment of letters with top dot configurations—single, double horizontal, triple up-triangular, and quadruple squared—for the typography required for miscellaneous Jewish languages, as these top-dotted letters are intended to imitate the ʾIJAM dots in the corresponding Arabic letters The Palestinian, Babylonian, and Yemenite systems of vowel pointing and cantillation. Please find the .PDF document on the polytonic Greek character codepoint listings; I’ll need to finish—and publish—a similar publication for Hebrew characters. Thank You!
Re: Code point vs. scalar value
Nicely stated. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Thu, Sep 19, 2013 at 11:21 PM, Whistler, Ken ken.whist...@sap.comwrote: Stephan Stiller seems unconvinced by the various attempts to explain the situation. Perhaps an authoritative explanation of the textual history might assist. ** ** Stephan demands an answer: ** ** I want to know why the Glossary claims that surrogate code points are [r]eserved for use by UTF-16. ** ** Reason #1 (historical): Because the Glossary entry for “Surrogate Code Point” has been worded thusly since Unicode 4.0 (p. 1377), published in 2003, and hasn’t been reworded since. ** ** Reason #2 (substantive): Because UTC members have been satisfied with the content of the statement and have not required it be changed in subsequent versions of the standard. ** ** Reason #3 (intentional): Because the wording was added in the first place as part of the change to identify the term “surrogate character”, which had been widely used before, as a misnomer and a usage to be deprecated. The term “surrogate code point” was a deliberate introduction at that time to refer specifically to the range U+D800..U+DFFF of “code points” which could *not* be used to encode abstract characters. ** ** Reason #4 (proximal): Because nobody recently has submitted a suggested improvement to the text of the relevant entry in the glossary (and associated text in Chapter 3) which has passed muster in the editorial committee and been considered to be an improvement on the text. ** ** If it is exegesis rather than textual history that concerns you, here is what I consider to be a full explanation of the meaning of the text that troubles you so: ** ** Code points in the range U+D800..U+DFFF are reserved for a special purpose, and cannot be used to encode abstract characters (thereby making them encoded characters) in the Unicode Standard. Note that it is perfectly valid to refer to these as code points and use the U+ prefix for them. The U+ prefix identifies the Unicode codespace, and the glossary (correctly) identifies that as the range of integers from 0 to 10. O.k., if the range of code points U+D800..U+DFFF are reserved for a special purpose, what is that purpose and how do we designate the range? The designation is easy: we call elements of the subrange U+D800.. U+DBFF “high-surrogate code point” (see D71) and the elements of the subrange U+DC00..U+DFFF “low-surrogate code point” (see D73), and by construction (and common usage), the elements contained in the union of those two subranges is called “surrogate code point”. What is the special purpose? The shorthand description of the purpose is that the “surrogate code points” are “used for UTF-16”. But since that seems to confuse a minority of the readers of the standard, here is a longer explication: The surrogate code points are deliberately precluded from use to encode abstract characters to enable the construction of an efficient and unambiguous mapping between Unicode scalar values (the U+..U+D7FF, U+1..U+10 subranges of the Unicode codespace) and the sequences of 16-bit code units defined in the UTF-16 encoding form. In other words, the reservation *from* encoding for the code points U+D800..U+DFFF enables the use of the numerical range 0xD800..0xDFFF to define surrogate pairs to map U+1..U+10, while otherwise retaining a simple one-to-one mapping from code point to code unit in UTF-16 for the BMP code points which *are* used for encoding abstract characters. In short, the surrogate code points are “used for UTF-16”. ** ** Stephan’s next demand for an answer was: ** ** Remind me real quick, in what way does a function use the input values that it's not defined on? ** ** Well, the problem here is in the formulation of the implied question. I suspect, from the discussion in this thread, that Stephan has concluded that the generic wording “used for” in the glossary item in question necessary imputes that the surrogate code points are therefore elements of the domain of the mapping function for UTF-16 (which maps Unicode scalar values to sequences of UTF-16 code units). Of course that imputation is incorrect. Surrogate code points are excluded form that domain, by *definition*, as intended. And I have explained above what the phrase “used for” is actually used for in the glossary entry. ** ** Finally: ** ** And what does this have to do with UTF-16? ** ** It is definitional for UTF-16. I think that should also be clear from the explanation above. ** ** Now, rather than quibbling further about what the glossary says, if the explanation still does not satisfy, and if the text in the glossary (and in Chapter 3) still seems wrong and misleading in some way, here is a more productive way forward:
Re: Draft of LDML Specification for CLDR release 24
Thanks for the feedback; the typo is fixed. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Sep 13, 2013 at 1:19 AM, Philippe Verdy verd...@wanadoo.fr wrote: Typo in section 2.3 Number Symbols, for the new item superscriptingExponent which describes: The superscripting can use markup, such as sup4/sub in HTML, (...) Of course this is sup4/sup 2013/9/13 John Emmons e...@us.ibm.com CLDR v24 is scheduled to be released next week (2013-09-18). While the LDML specification (* http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html*http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html) and release note (*http://cldr.unicode.org/index/downloads/cldr-24*http://cldr.unicode.org/index/downloads/cldr-24) are still being worked on, we'd welcome feedback on any major problems in the text. A summary of the changes to specification can be found at: * http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Modifications *http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Modifications Regards, John C. Emmons Globalization Architect Unicode CLDR TC Chairman IBM Software Group Internet: e...@us.ibm.com
Re: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ
Classical Greek might qualify [for a CLDR entry] It certainly qualifies, but we require that a submitter commit to collecting a minimal amount of data before we add it. See http://cldr.unicode.org/index/cldr-spec/minimaldata Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Aug 5, 2013 at 3:58 PM, Stephan Stiller stephan.stil...@gmail.comwrote: On 8/5/2013 11:26 AM, Whistler, Ken wrote: Inclusion of the precomposed characters now seen in the U+1FXX block was part of the price of the merger. What was included was precisely the repertoire requested by Greece, and no attempt was made to further rationalize forms including macrons for Ancient Greek. Thanks, Ken. It's good to know that there is no other reason. Partial credit goes to Tom Gewecke who had pointed me off-list to http://www.tlg.uci.edu/~opoudjis/unicode/ken_adscripts.html and the fact that the precomposed set from ISO 10646 can be traced back to ELOT (ΕΛΟΤ). On 8/5/2013 1:25 PM, Richard Wordingham wrote: Classical Greek might qualify [for a CLDR entry] Yes or no, and I have in fact no(t yet an) opinion on the necessity thereof – it's a different question from the one to what extent D matters for A *if* A had an entry, but I think we're on the same page at this point: On 8/5/2013 1:25 PM, Richard Wordingham wrote: However, if vowels with macrons had made it into D, then one would expect them in A. Yep, I agree. A loose analogy and one sensible view (which is in fact compatible with yours) is that it's imaginable for say a lexicographer for English to have some version of Cyrillic letters available for typesetting but defensible for him to not have/use stress marks, whereas any Cyrillic typesetting engine within a Cyrillic locale should be able to provide them. This made-up example is imperfect, but it might help someone understand the thread. That said, I have not yet formed an opinion on whether a font intended for a Modern Greek locale should be able to render ᾱ, ῑ, ῡ with additional diacritics. (One intended for Ancient Greek should, I think.) Stephan
Re: Behdad Esfahbod won an O'Reilly Open Source Award!
Great news, and well deserved! Congratulations, Behdad! Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jul 29, 2013 at 9:41 PM, Roozbeh Pournader rooz...@google.comwrote: Some of you probably have heard the news already, but in case you haven't, Behdad won the prestigious O'Reilly Open Source Award, announced last Friday. Here's the announcement: http://www.oscon.com/oscon2013/public/schedule/detail/29956 Selected quotes: The O’Reilly Open Source Awards recognize individual contributors who have demonstrated exceptional leadership, creativity, and collaboration in the development of Open Source Software. [...] *Behdad Esfahbod (HarfBuzz):* Through the HarfBuzz project Behdad is working relentlessly to get all languages supported in Free Software operating systems, word processors, devices and browsers, no matter how complex their scripts are. I wish to congratulate Behdad for his achievements, which has really helped make open source way more accessible to billions of users around the world. I'm eagerly waiting for his amazing magic and superhacker skills to bear even more fruits over the years to come. I'm proud to have been able to call him a friend, colleague, and collaborator for more than fifteen years now. Roozbeh
Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?
Popping up a level. ICU (and some other libraries) have heuristic encoding detection, that will take a sequence of bytes and come up with a likely encoding id. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken ken.whist...@sap.com wrote: Suppose that these hex bytes: C3 83 C2 B1 show up in a message and the message contains no hint what its encoding is. Perhaps it is 8859-1, in which case the message consists of four 1-byte characters: C3 = Ã 83 = the “no break here” character C2 = Â B1 = ± Perhaps it is UTF-8, in which case the message consists of two 2-byte characters: C383 = 쎃 C2B1 = 슱 Actually, that would be interpreting it as UTF-16, not as UTF-8. That can probably be quickly ruled out if the rest of the text is not obviously in UTF-16. Interpreted as UTF-8, it would be: C3 83 -- U+00C3 = Ã C2 B1 -- U+00B1 = ± More likely than the other two alternatives you cite. Of course, you also have to consider serial corruptions as a possibility. It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ. Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted as UTF-8 again. --Ken
Re: The skywriter we hired has terrible Unicode support
Saw that, thanks! Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Wed, May 8, 2013 at 8:26 PM, Tim Greenwood timo...@greenwood.namewrote: http://xkcd.com/1209/
RE: Encoding localizable sentences (was: RE: UTC Document Register Now Public)
LOL... {phone} On Apr 20, 2013 8:44 PM, Erkki I Kolehmainen e...@iki.fi wrote: Mr. Overington, I'm sorry to have to admit that I cannot follow at all your train of thought on what would be the practical value of localizable sentences in any of the forms that you are contemplating. In my mind, they would not appear to broaden the understanding between different cultures (and languages), quite the contrary. I appreciate the fact that there are several respectable members of this community who are far too polite to state bluntly what they think of the technical merits of your proposal. Sincerely, Erkki I. Kolehmainen -Alkuperäinen viesti- Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] Puolesta William_J_G Overington Lähetetty: 20. huhtikuuta 2013 12:39 Vastaanottaja: KenWhistler Kopio: unicode@unicode.org; KenWhistler; wjgo_10...@btinternet.com Aihe: Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public) On Friday 19 April 2013, Whistler, Ken ken.whist...@sap.com wrote: You are aware of Google Translate, for example, right? Yes. I use it from time to time, mostly to translate into English: it is very helpful. If you input sentences such as those in your scenarios or the other examples, such as: Where can I buy a vegetarian meal with no gluten-containing ingredients in it please? You can get immediately serviceable and understandable translations in dozens of languages. For example: Wo kann ich ein vegetarisches Essen ohne Gluten-haltigen Bestandteile davon, bitte? Not perfect, perhaps, but perfectly comprehensible. And the application will even do a very decent job of text to speech for you. I am not a linguist and I know literally almost no German, so I am not able to assess the translation quality of sentences. Perhaps someone on this list who is a native speaker of German might comment please. I am thinking that the fact that I am not a linguist and that I am implicitly seeking the precision of mathematics and seeking provenance of a translation is perhaps the explanation of why I am thinking that localizable sentences is the way forward. There seems to a fundamental mismatch deep in human culture of the way that mathematics works precisely yet that translation often conveys an impression of meaning that is not congruently exact. Perhaps that is a factor in all of this. Thank you for your reply and for taking the time to look through the simulations and for commenting. Having read what you have written and having thought about it for a while I am wondering whether it would be a good idea for there to be a list of numbered preset sentences that are an international standard and then if Google chose to front end Google Translate with precise translations of that list of sentences made by professional linguists who are native speakers, then there could be a system that can produce a translation that is precise for the sentences that are on the list and machine translated for everything else. Maybe there could then just be two special Unicode characters, one to indicate that the number of a preset sentence is to follow and one to indicate that the number has finished. In that way, text and localizable sentences could still be intermixed in a plain text message. For me, the concept of being able to mix text and localizable sentences in a plain text message is important. Having two special characters of international standard provenance for denoting a localizable sentence markup bubble unambiguously in a plain text document could provide an exact platform. If a software package that can handle automated localization were active then it could replace the sequence with the text of the sentence localized into the local language: otherwise the open localizable sentence bubble symbol, some digits and the close localizable sentence bubble symbol would be displayed. If that were the case then there might well not be symbols for the sentences, yet the precise conveying of messages as envisaged in the simulations would still be achievable. Perhaps that is the way forward for some aspects of communication through the language barrier. Another possibility would be to have just a few localizable sentences with symbols as individual characters and to have quite a lot of numbered sentences using a localizable sentence markup bubble and then everything else by machine translation. I shall try to think some more about this. At any rate, if Margaret Gattenford and her niece are still stuck at their hotel and the snow is blocking the railway line, my suggestion would be that Margaret whip out her mobile phone. And if she doesn't have one, perhaps her niece will lend hers to Margaret. Well, they were still staying at the hotel were some time ago. They feature in locse027_simulation_five.pdf available from the following post.
Re: Rendering Raised FULL STOP between Digits
Should the Unicode Consortium decide to recommend an existing (or new) character as a raised decimal for numbers, we would add that to CLDR, and recommend that implementations accept either one as equivalent when parsing. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sun, Mar 10, 2013 at 10:39 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Sat, 9 Mar 2013 18:58:45 -0700 Doug Ewell d...@ewellic.org wrote: Richard Wordingham wrote: The general feeling seems to be that computers don't do proper decimal points, and so the raised decimal point is dropping out of use. Any discussion of whether computers handle decimal points properly can't happen without talking about number-to-string conversion routines in programming languages and frameworks. The question is what users will demand. Expectations have been low enough that the loss of decimal points has been accepted. Additionally, striving for an apparently hard to get raised decimal point risks being forced to use an achievable decimal comma. Conversion routines are often able to choose between full stop and comma as the decimal separator, based on locale, but I'm not aware of any that will use U+00B7. The same is true for using U+2212, or even U+2013, as the negative sign instead of U+002D, which looks just terrible for this purpose in many fonts. U+2212 is not necessary for English (see CLDR exemplar characters), so CLDR policy (if not rules) do not allow it in CLDR conversion rules. I'm feeling lucky that I've got away with using it in documents for a few years now, but may be I've only succeeded because we've been cut and pasting from a Unicode-aware environment (Windows) to an 8-bit environment (ill-maintained Solaris, hated by management). Richard.
Re: JSON version of CLDR
I think just the main data is converted. If you want to request the other data you can file a cldr ticket. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sat, Mar 2, 2013 at 8:35 PM, Edwin Hoogerbeets ehoogerbe...@gmail.comwrote: Hi all, I am trying to find the CLDR collation tailoring and DUCET data in JSON format. I looked at the CLDR data published for release 22.1 ( http://www.unicode.org/repos/cldr-aux/json/22.1/), but it doesn't seem to be there. Is this the right place to look for that? (Is it even converted to JSON format yet?) Thanks, Edwin
Re: What does it mean to not be a valid string in Unicode?
But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is non-conformant, you have to be very clear what it is non-conformant *TO* . Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). - That *is* conformant for *Unicode 16-bit strings.* - That is *not* conformant for *UTF-16*. There is an important difference. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote: But still non-conformant.
Re: Are there Unicode processors?
That is not the typical way that Unicode text is processed. Typically whatever OS you are using will supply mechanisms for iterating through any Unicode string, returning each of the code points. It may also offer APIs for returning information about each character (called 'property values', or you can get libraries like ICU (http://site.icu-project.org/) that have full-featured property support ( http://userguide.icu-project.org/strings/properties). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.orgwrote: Hi Folks, An XML processor breaks up an XML document into its parts -- here's a start tag, here's element content, here's an end tag, etc. -- and then makes those parts (along with information about each part such as this part is a start tag and this part is element content) available to XML applications via an API. Are there Unicode processors? That is, are there processors that break up Unicode text into its parts -- here's a character, here's another character, here's still another character, etc. -- and then makes those parts (along with information about each part such as this part is the Latin Capital Letter T and this part is the Latin Small Letter o) available to Unicode applications (such as XML processors) via an API? I did a Google search for Unicode processor and came up empty so I am guessing the answer is that there are no Unicode processors. Or perhaps they go by a different name? If there are no Unicode processors, why not? /Roger
Re: What does it mean to not be a valid string in Unicode?
That's not the point (see successive messages). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote: On 2013/01/08 3:27, Markus Scherer wrote: Also, we commonly read code points from 16-bit Unicode strings, and unpaired surrogates are returned as themselves and treated as such (e.g., in collation). That would not be well-formed UTF-16, but it's generally harmless in text processing. Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. Regards, Martin.
Re: What does it mean to not be a valid string in Unicode?
In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation. It has nothing to do with a legacy of C pointer arithmetic. It does represent a pragmatic choice some time ago, but there is no need getting worked up about it. Human scripts and their representation on computers is quite complex enough; in the grand scheme of things the handling of surrogates in implementations pales in significance. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller stephan.stil...@gmail.comwrote: Things like this are called garbage in, garbage-out (GIGO). It may be harmless, or it may hurt you later. So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-) Wouldn't the clean way be to ensure valid strings (only) when they're built and then make sure that string algorithms (only) preserve well-formedness of input? Perhaps this is how the system grew, but it seems to be that it's yet another legacy of C pointer arithmetic and about convenience of implementation rather than a safety or performance issue. Stephan
Re: What does it mean to not be a valid string in Unicode?
Some of this is simply historical: had Unicode been designed from the start with 8 and 16 bit forms in mind, some of this could be avoided. But that is water long under the bridge. Here is a simple example of why we have both UTFs and Unicode Strings. Java uses Unicode 16-bit Strings. The following code is copying all the code units from string to buffer. StringBuilder buffer = new StringBuilder(); for (int i = 0; i string.length(); ++i) { buffer.append(i.charAt(i)); } If Java always enforced well-formedness of strings, then 1. The above code would break, since there is an intermediate step where buffer is ill-formed (when just the first of a surrogate pair has been copied). 2. It would involve extra checks in all of the low-level string code, with some impact on performance. Newer implementations of strings, such as Python's, can avoid these issues because they use a Uniform Model, always dealing in code points. For more information, see also http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html (There are many, many discussions of this in the Unicode email archives if you have more questions.) Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sat, Jan 5, 2013 at 11:14 PM, Stephan Stiller stephan.stil...@gmail.comwrote: If for example I sit on a committee that devises a new encoding form, I would need to be concerned with the question which *sequences of Unicode code points* are sound. If this is the same as sequences of Unicode scalar values, I would need to exclude surrogates, if I read the standard correctly (this wasn't obvious to me on first inspection btw). If for example I sit on a committee that designs an optimized compression algorithm for Unicode strings (yep, I do know about SCSU), I might want to first convert them to some canonical internal form (say, my array of non-negative integers). If U+surrogate values can be assumed to not exist, there are 2048 fewer values a code point can assume; that's good for compression, and I'll subtract 2048 from those large scalar values in a first step. Etc etc. So I do think there are a number of very general use cases where this question arises. In fact, these questions have arisen in the past and have found answers then. A present-day use case is if I author a programming language and need to decide which values for val I accept in a statement like this: someEncodingFormIndependentUnicodeStringType str = val, specified in some PL-specific way I've looked at the Standard, and I must admit I'm a bit perplexed. Because of C1, which explicitly states A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I do not know why surrogate values are defined as code points in the first place. It seems to me that surrogates are (or should be) an encoding form–specific notion, whereas I have always thought of code points as encoding form–independent. Turns out this was wrong. I have always been thinking that code point conceptually meant Unicode scalar value, which is explicitly forbidden to have a surrogate value. Is this only terminological confusion? I would like to ask: Why do we need the notion of a surrogate code point; why isn't the notion of surrogate code units [in some specific encoding form] enough? Conceptually surrogate values are byte sequences used in encoding forms (modulo endianness). Why would one define an expression (Unicode code point) that conceptually lumps Unicode scalar value (an encoding form–independent notion) and surrogate code point (a notion that I wouldn't expect to exist outside of specific encoding forms) together? An encoding form maps only Unicode scalar values (that is all Unicode code points excluding the surrogate code points), by definition. D80 and what follows (Unicode string and Unicode X-bit string) exist, as I understand it, *only* in order for us to be able to have terminology for discussing ill-formed code unit sequences in the various encoding forms; but all of this talk seems to me to be encoding form–dependent. I think the answer to the question I had in mind is that the legal sequences of Unicode scalar values are (by definition) ({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* . But then there is the notion of Unicode string, which is conceptually different, by definition. Maybe this is a terminological issue only. But is there an expression in the Standard that is defined as sequence of Unicode scalar values, a notion that seems to me to be conceptually important? I can see that the Standard defines the various well-formed encoding form code unit sequence. Have I overlooked something? Why is it even possible to store a surrogate value in something like the icu::UnicodeString datatype? In other words, why are we concerned with storing Unicode *code points* in data structures instead
Re: If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?
There are many cases of such digraphs. Example from Slovak: c d h but cd h ch Cf http://www.unicode.org/reports/tr10/, searching for Slovak. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sun, Jan 6, 2013 at 1:56 PM, Costello, Roger L. coste...@mitre.orgwrote: Hi Folks, In the book, Unicode Demystified (p. xxii) it says: An English-speaking programmer might assume, for example, that given the three characters X, Y, and Z, that if X sorts before Y, then XZ sorts before YZ. This works for English, but fails for many languages. Would you give an example of where character 1 sorts before character 2 but character 1, character 3 does not sort before character 2, character 3? /Roger
Re: holes (unassigned code points) in the code charts
http://www.unicode.org/alloc/CurrentAllocaiton.html = http://www.unicode.org/alloc/CurrentAllocation.html Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Jan 4, 2013 at 10:24 AM, Whistler, Ken ken.whist...@sap.com wrote: Stephan Stiller continued: Occasionally the question is asked how many characters Unicode has. This question has an answer in section D.1 of the Unicode Standard. I suspect, however, that once in a while the motivation for asking this question is to find out how much of Unicode has been used up. As filling in holes would be dispreferred, it might be interesting to know how much of Unicode has been filled if one counts partially filled blocks as full. I have no reason to disagree with the (stated and reiterated) opinion that our codespace won't be used up in the foreseeable future, but it's simply a fun question to ask. The editors maintain some statistical information relevant to this fun question at: http://www.unicode.org/alloc/CurrentAllocaiton.html Feel free to reference those fun facts the next time Unicode comes up in conversation at the bar. ;-) There have been a few notable examples where particularly egregious examples of holes in blocks that seemed unlikely to be filled with like material in the future were reprogrammed as it were, and grabbed for the encoding of unrelated sets of characters. The most notable of these is the range U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A block. There was a clear consensus in both committees that nobody wanted to add any more encodings for presentation forms of Arabic ligatures. So, when a need arose to add another range of noncharacters, the UTC simply decided that the otherwise unused range U+FDD0..U+FDEF could serve for that, while not requiring the addition of a new two-column block that could otherwise be used on the BMP. There are several symbol blocks on the BMP which have also had a somewhat colorful and creative history of hole-filling over time. --Ken
Re: What does it mean to not be a valid string in Unicode?
To assess whether a string is invalid, it all depends on what the string is supposed to be. 1. As Ken says, if a string is supposed to be in a given encoding form (UTF), but it consists of an ill-formed sequence of code units for that encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in UTF-16 or any surrogate (eg 0xD800) in UTF-32 would make the string invalid. For example, a Java String may be an invalid UTF-16 string. See http://www.unicode.org/glossary/#unicode_encoding_form 2. However, a Unicode X-bit string does not have the same restrictions: it may contain sequences that would be ill-formed in the corresponding UTF-X encoding form. So a Java String is always a valid Unicode 16-bit string. See http://www.unicode.org/glossary/#unicode_string 3. Noncharacters are also valid in interchange, depending on the sense of interchange. The TUS says In effect, noncharacters can be thought of as application-internal private-use code points. If I couldn't interchange them ever, even internal to my application, or between different modules that compose my application, they'd be pointless. They are, however, strongly discouraged in *public* interchange. The glossary entry and some of the standard text is a bit old here, and needs to be clarified. 4. The quotation we select a substring that begins with a combining character, this new string will not be a valid string in Unicode. is wrong. It *is* a valid Unicode string. It isn't particularly useful in isolation, but it is valid. For some *specific purpose*, any particular string might be invalid. For example, the string mark#d might be invalid in some systems as a password, where # is disallowed, or where passwords might be required to be 8 characters long. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller stephan.stil...@gmail.comwrote: A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't follow the specification for UTF-8, for example. Given that answer, add in UTF-32 to my email just now, for simplicity's sake. Or let's simply assume we're dealing with some sort of sequence of abstract integers from hex+0 to hex+10, to abstract away from encoding form issues. Stephan
Re: locale-aware string comparisons
Agreed. FYI, for those interested, here is the data file I generated with the approaches A, B, C as discussed. https://docs.google.com/a/google.com/spreadsheet/pub?key=0AqRLrRqNEKv-dGk0RHVoQWN6OGw1TVFNOWRaMEJfWEEgid=0 Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Wed, Jan 2, 2013 at 11:07 AM, Shawn Steele shawn.ste...@microsoft.comwrote: I'd try to avoid making a dependency where case mapping needs to be the same as case insensitive comparisons. I'd either always case fold then compare, or always compare case insensitive. -Shawn -Original Message- From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf Of James Cloos Sent: Tuesday, January 1, 2013 5:43 PM To: Mark Davis ☕ Cc: Whistler, Ken; unicode@unicode.org Subject: Re: locale-aware string comparisons MD == Mark Davis ☕ m...@macchiato.com writes: MD All of these are different, all of them still have over 200 MD differences from either compare(lower(x),lower(y)) or compare(upper MD (x),upper(y)) What about, then: compare(lower(x),lower(y)) || compare(upper(x),upper(y)) Or, to emphasize that I mentioned C only as a pseudocode, akin to SQL: LOWER(x) LIKE LOWER(y) OR UPPER(x) LIKE UPPER(y) Would that cover all of the outliers? -JimC -- James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6
Re: locale-aware string comparisons
3. Regarding LDML and CLDR, somebody with specific expertise on CLDR James, Even without locale differences, the situation is a bit tricky. Assuming that str_tolower() and str_toupper() were straightforwardly defined in terms of the (full) Unicode case mappings, there is still the issue that the DUCET does not define a caseless compare. It puts case together with other variants into a set of Level 3 data. There are 3 approaches one can take with a strcasecmp() straightforwardly based on LDML. I generated some numbers for these with a quick test program, but note that they use the CLDR root locale, which has a few changes from DUCET. A. Define it to be just comparing after Unicode case folding. B. Use DUCET and only compare according to Level 1 2. That ignores case, but also some other features. C. Use the case level as defined in LDML, plus Levels 1 2. All of these are different, all of them still have over 200 differences from either compare(lower(x),lower(y)) or compare(upper(x),upper(y)) These are mostly because special weighting of compatibility variants, or of the Greek iota subscript. Example: s ſ, but upper( s ) = upper( ſ ) // LATIN SMALL LETTER S vs LATIN SMALL LETTER LONG S Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Dec 31, 2012 at 3:29 PM, Whistler, Ken ken.whist...@sap.com wrote: Well, in answering the question which was actually posed here: 1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646 does not define case mapping at all. 2. The Unicode Standard *does* define case mapping, of course, as well as case folding. The relevant details are in Section 3.13 of the standard, supported by various data files in the Unicode Character Database. TUS 6.2, Section 3.13, p. 117, does define toUpperCase(X) and toLowerCase(X), but those are string mapping operations, not directly comparable to Linux (and in general Unix) toupper() and tolower(), which are character mapping functions. The closer correlates to Linux toupper() and tolower() are Unicode's definitions of Uppercase_Mapping(C) and Lowercase_Mapping(C). However, there is a significant difference lurking, in that the Unicode case mapping definitions are not locale-sensitive. The full case mappings do include two conditional sets of mappings (from SpecialCasing.txt) for Lithuanian and for Turkish and Azeri, mostly affecting the behavior of the dot on i, but the use of those conditional mappings depends on the availability of explicit language context. This contrasts with the Linux (and in general Unix) toupper() and tolower() functions, which in principle, at least, are locale-sensitive, depending on the current locale setting, and in particular on whether the LC_CTYPE category in the locale has a non-null list of mappings for toupper and/or tolower in it. Perhaps even more importantly, the Unicode Standard does not state anything regarding the details of the behavior of the APIs strcasecmp() or tolower() or toupper() in libc. Those are the concerns of the C and POSIX specs, not the Unicode Standard. Nor could the Unicode Standard really get involved in this, precisely because that behavior involves locales, and locales are outside the scope of the Unicode Standard. 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may have to jump in here, but while locales clearly *are* in the scope of LDML and CLDR, there is currently little if anything they have to say about specific case mapping rules. As regards the particulars of the question, I suspect that it would depend in part on how strcasecmp(), str_tolower() and str_toupper() are implemented (I am assuming string conversions APIs here based on the tolower() and toupper() APIs), but there probably *are* instances where the results would diverge. The most likely source of trouble would be Turkish case mapping. In particular, if you compare U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE to a canonically equivalent sequence of U+0049, U+0307, there may be conundrums. If strcasecmp() is implemented based on Turkish case folding, then strcasecmp( U+0130, U+0049, U+0307 ) == 0. If str_tolower() is based on Turkish case mapping, then str_tolower( U+0130 ) == U+0069, U+0307, so strcmp(str_tolower( U+0130), str_ tolower( U+0049,U+0307 ) == 0, *but* str_toupper( U+0130 ) == U+0130 and str_toupper( U+0049,U+0307 ) == U+0049,U+0307, so strcmp(str_toupper( U+0130 ), str_toupper( U+0049,U+0307 ) != 0. The two upperc! ased versions are *canonically* equivalent, but you wouldn't expect a strcmp() operation to be checking normalization of strings. So unless the implementations of str_tolower() and str_ toupper() were doing canonical normalization as well as case mapping, you could indeed find some odd edge cases for Turkish casing, at least. --Ken Given (just) the data in 10646, Unicode and cldr, are there any locales where a
Re: Character name translations
There are different use cases, and I think they are getting confused. 1. Present a name for each character, some sort of formal name. I think this is probably the least useful for average users. 2. Allow searching for characters, eg in a character picker. Sample use case: search for dash (or the equivalent in Georgian) and get the dashes. 3. Provide disambiguating information about a character (to distinguish from visually similar characters). Sample use case: have a hover over a character and show em dash vs en dash (or the equivalent in Georgian). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Thu, Dec 20, 2012 at 8:18 AM, Asmus Freytag asm...@ix.netcom.com wrote: In my other message, I made clear that I think translations of just the names is a lot less useful than translation of the full information presented in the code charts, which includes block (and therefore script) names, annotations and listing of alternate names by which these characters are known to ordinary users.
Some much-needed improvements in JavaScript i18n
I have a new google blog post about the new ECMAScript (JavaScript) internationalization spec. “Until now, it has been very difficult for web application designers to do something as simple as sort names correctly according to the user's language. And it matters: English readers wouldn’t expect Århus to sort below Zürich, but Danish speakers would.” … http://googledevelopers.blogspot.com/2012/12/putting-zurich-before-arhus.html Many people contributed to this multi-year effort! Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* **
Re: Question about normalization tests
0300 *is* blocked, because there is a preceding character (0305) that has the same combining class (230). Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Dec 10, 2012 at 11:55 AM, Edwin Hoogerbeets ehoogerbe...@gmail.comwrote: Looking at 0300, it is also not blocked from 0061, so check the primary composition for 0061 0300. There is a primary composition for that sequence, 00E0, so replace the starter with that, delete the 0300, and continue. The string looks like this now:
Re: io9 describes Unicode as one of the 10 most unlikely things influenced by J.R.R. Tolkien
Their inference, it appears, is that had I not read Tolkien when I was 13 I would not be who I am today and the content of the Universal Character Set might be a lot different than it is. I doubt it. Many people are far more responsible for the structure, model, properties, and characters of Unicode, including not only those who belong to the Unicode consortium, but also those in the IRG, those in ISO, and those who originally developed the international, national, and vendor encoding standards that Unicode built upon. Unicode characters, measured by frequency of usage on the web, would be essentially the same had Michael not been around. That would not be the case without people like Ken Whistler, Joe Becker, Lee Collins, Lisa Moore, Michel Suignard, or Asmus Freytag: I could go on, but there are far to many to name. Nor would Unicode have been a success without the many people who worked in different companies to build the infrastructure necessary for its use, or the staff behind the scenes working in the Unicode Consortium. Michael has made many valuable contributions to Unicode, especially for minority and historic scripts. And he can be rightfully proud of the work he has done there. But neither should that work be exaggerated. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Sat, Dec 8, 2012 at 2:56 AM, Michael Everson ever...@evertype.comwrote: On 8 Dec 2012, at 10:07, Shriramana Sharma samj...@gmail.com wrote: Well nice to hear, and of course you have contributed a lot to Unicode! But I fail to see the logical connection between Unicode as a technical standard and Tolkien! I hadn't heard about this website, but if they purport to write on science, but make such illogical deductions, I am not sure I'll be reading it much in future. Their inference, it appears, is that had I not read Tolkien when I was 13 I would not be who I am today and the content of the Universal Character Set might be a lot different than it is. Michael Everson * http://www.evertype.com/
Re: StandardizedVariants.txt error?
I agree with that analysis. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Nov 26, 2012 at 1:53 PM, Whistler, Ken ken.whist...@sap.com wrote: Actually, I think the omission here is the word canonical. In other words, Section 16.4 should probably read: The base character in a variation sequence is never a combining character or a *canonical* decomposable character. Note that with this addition, StandardizedVariants.txt poses no contradiction, because all of the decomposable character instances noted are compatibility decomposable characters. The main concern here with this restriction is to ensure that one doesn't end up with conundrums involving canonical decompositions into sequences followed by a variation selector. In the case of compatibility decompositions, there already is no expectation that neither the appearance nor the interpretation of the text will change. With a decomposition mapping like font 0069, the decomposition is already indicating a typically different appearance. If you decompose U+2139 to U+0069, you have already lost information about appearance and interpretation. So it isn't that much of a stretch to assume that any relevant variation sequences will also lose their interpretation. But I think it might make sense, in addition to the above textual fix, to add a note to the standard to indicate that variation sequences preserve their validity across *canonical* normalization forms, but that there is no guarantee that variation sequences will remain valid for any compatibility normalization. --Ken 2012-11-24 8:12, Masatoshi Kimura wrote: According to TUS v6.2 clause 16.4, http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf#page=15 The base character in a variation sequence is never a combining character or a decomposable character. However, the following base characters appearing in http://unicode.org/Public/6.2.0/ucd/StandardizedVariants.txt have a decomposition mapping. There seems to be a contradiction here. “Decomposable character” is defined in clause 3.7 as follows: “A character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the Unicode Character Database, and those described in Section 3.12, Conjoining Jamo Behavior.” I suppose the intended meaning in clause 16.4, given its context, is to say that the base character is neither a combining character nor a character with a decomposition that contains a combining character. Yucca
Re: Caret
This case remains very infrequent: it is extremely rare to start typing text in With arrow keys or mouse clicking it is more frequent to end up on a directional boundary. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Mon, Nov 12, 2012 at 1:47 PM, Asmus Freytag asm...@ix.netcom.com wrote: On 11/12/2012 1:27 PM, Khaled Hosny wrote: I’m not sure from where you are getting your statistics, but I’ve to deal with all those “rare” and “extremely rare” situations all the day. Khaled, don't mind Philippe - his experience is a bit on the theoretical end. A./
Re: Character set cluelessness
I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. ** ** Jony ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *john knightley *Sent:* Tuesday, October 02, 2012 6:35 PM *To:* Doug Ewell *Cc:* unicode@unicode.org; loc...@unece.org *Subject:* Re: Character set cluelessness ** ** Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Character set cluelessness
Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. ** ** Jony ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *john knightley *Sent:* Tuesday, October 02, 2012 6:35 PM *To:* Doug Ewell *Cc:* unicode@unicode.org; loc...@unece.org *Subject:* Re: Character set cluelessness ** ** Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Character set cluelessness
And just to be clear, I do agree that their documentation of the standards usage, well, needs improvement. I'm just talking about the actual data, and for that as a practical matter it is valuable to have both the native language version(s) of a name, and a Latin equivalent. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 2:52 PM, Mark Davis ☕ m...@macchiato.com wrote: Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote: I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com wrote: I don't agree with the criticism. These place name are there to be readable by a wide audience, rather than writable by locals and specialists. They require the lowest common denominator. ** ** Jony ** ** *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On Behalf Of *john knightley *Sent:* Tuesday, October 02, 2012 6:35 PM *To:* Doug Ewell *Cc:* unicode@unicode.org; loc...@unece.org *Subject:* Re: Character set cluelessness ** ** Sad to say this seems to be close to the norm for all to many large organizations where if it isn't in the 1990's version of the Times Roman font then it's out. John On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote: The United Nations Economic Commission for Europe (UNECE) has released a new version of UN/LOCODE, and their Secretariat Note document is just as clueless as ever about character set usage in international standards: Place names in UN/LOCODE are given in their national language versions as expressed in the Roman alphabet using the 26 characters of the character set adopted for international trade data interchange, with diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be 3.3.2] of the UN/LOCODE Manual). International ISO Standard character sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The standard United States character set (437), which conforms to these ISO standards, is also widely used in trade data interchange). It's 2012. How does one get through to folks like this? I tried writing to them a few years ago, but I don't think they were impressed by an individual contribution. http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell
Re: Announcing The Unicode Standard, Version 6.2
BTW, if you want to share the announcement: - Google+: https://plus.sandbox.google.com/u/0/109412260435993059737/posts (I also reposted at with my personal accounthttps://plus.google.com/114199149796022210033 .) - Facebook: http://www.facebook.com/pages/Friends-of-Unicode/127785250588285 - Twitter: http://twitter.com/unicode/ Mark ** On Wed, Sep 26, 2012 at 1:06 PM, announceme...@unicode.org wrote: ** Version 6.2 of the Unicode Standard is now available. This version adds only a single character, the newly adopted Turkish Lira sign; however, the properties and behaviors for many other characters have been adjusted. Emoji and pictographic symbols now have significantly improved line-breaking, word-breaking and grapheme cluster behaviors. The script categorizations for some characters are improved and better documented. The Unicode Collation Algorithm has been greatly enhanced for Version 6.2, with a major overhaul of its documentation. There have also been significant changes to the collation weight tables, including improved handling of tertiary weights for characters with decompositions, and changed weights for some pictographic symbols. The newly encoded Turkish Lira sign, like other currency symbols, is expected to be heavily used in its target environment. The Unicode Consortium accelerated the release of Unicode 6.2, to accommodate the urgent need for this character. For more details of this release, see http://www.unicode.org/versions/Unicode6.2.0/. TurkishLira75pct.jpg