Re: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Mark Davis ☕️
And, Marcel, while you are at it, this is getting tiresome.

Please find some other place to vent about events you know very little
about; the internet is full of them.

Mark


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Tue, Jun 16, 2015 at 7:33 PM, Doug Ewell d...@ewellic.org wrote:

 Marcel Schneider charupdate at orange dot fr wrote:

  That's to despise people, that's to spit at their face.

 You know what? If you want to use U+02BC as an English apostrophe, go
 ahead and use it. Nobody's stopping you really. Not Unicode, not
 Microsoft, not ISO.

 I do wish we could put an end to all the accusations of malfeasance.

 --
 Doug Ewell | http://ewellic.org | Thornton, CO 





Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Mark Davis ☕️
On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider charupd...@orange.fr
wrote:

 When we take the topic down again from linguistics to the core mission of
 Unicode, that is character encoding and text processing standardisation,
 ellipsis and Swedish abbreviation colon differ from the single closing
 quotation mark in this, that they are not to be processed.



 Linguistics, however, delivered the foundation on which Unicode issued its
 first recommendation on what character to use for apostrophe. The result
 was neither a matter of opinion, nor of probabilities.



 Actually, the choice is between perpetuating confusion in word processing,
 and get people confused for a little time when announcing that U+2019 for
 apostrophe was a mistake.


​Quite nice of you to inform me of the core mission of Unicode—I must have
somehow missed that.


More seriously, it is not all so black and white. As we developed​ Unicode,
we considered whether to separate characters by function, eg, an END OF
SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING
PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs
far, far outweighed the benefits.

In practice, whenever characters are essentially identical—and by that I
mean that the overlap between the acceptable glyphs for each character is
very high—people will inevitably mix up the characters on entry. So any
processing that depends on that distinction is forced to correct the data
anyway. And separating them causes even simple things like searching for a
character on a page to get screwed up without having equivalence classes.

So we only separated essentially identical characters in limited cases:
such as letters from different scripts.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


Re: Another take on the English apostrophe in Unicode

2015-06-13 Thread Mark Davis ☕️
On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable peter...@microsoft.com
wrote:

 When it comes to orthography, the notion of what comprise words of a
 language is generally pure convention. That’s because there isn’t any
 single *_linguistic_ *definition of word that gives the same answer when
 phonological vs. morphological or syntactic criteria are applied. There are
 book-length works on just this topic, such as this:


​In particular, I see no need to change our recommendation on the character
used in contractions for English and many other languages (U+2019).
Similarly, we wouldn't recommend use of anything but the colon for marking
abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for
​supercali...docious.

(IMO, U+02BC was probably just a mistake; the minor benefit is not worth
the confusion.)

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


Re: free download of ISO/IEC 10646 (was: Accessing the WG2 document register)

2015-06-11 Thread Mark Davis ☕️
​I think the whole thread got overheated, and Andrew was just responding to
other heated ​comments. So it might be time to let this thread cool off a
bit.

The collaboration over the years between the Unicode Consortium and ISO has
been, on the whole, a remarkable success. There have been frictions—as in
any human enterprise—but the parties have worked to smooth those over, and
to operate in good faith to incorporate the characters that are important
to each side. The rising bureaucracy on the ISO side has made progress and
collaboration increasingly difficult, but that did not originate with the
SC2 or WG2 participants, who are often just as frustrated by it.


Re: http://✈.ws

2015-06-05 Thread Mark Davis ☕️
Whoops, sent too soon.

A surprise: http://✈.ws


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Fri, Jun 5, 2015 at 4:47 PM, Mark Davis ☕️ m...@macchiato.com wrote:





http://✈.ws

2015-06-05 Thread Mark Davis ☕️



Re: The Oral History Of The Poop Emoji

2015-06-01 Thread Mark Davis ☕️
One of many on http://unicode.org/press/emoji.html


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Mon, Jun 1, 2015 at 8:23 PM, Karl Williamson pub...@khwilliamson.com
wrote:


 https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america



Re: FYI: The world’s languages, in 7 maps and charts

2015-05-27 Thread Mark Davis ☕️
Hmmm. How accurate can it be? They forgot Austria, and got Switzerland
wrong by almost a power of 10.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye moy...@gmail.com wrote:

 The South China Morning Post published a similar infographic:
 A world of languages - and how many speak them

 http://www.scmp.com/infographics/article/1810040/infographic-world-languages



Re: FYI: The world's languages, in 7 maps and charts

2015-05-27 Thread Mark Davis ☕️
I think it is gives a misleading picture to only include mother-language
speakers, rather than all languages (at a reasonable level of fluency).
Every Swiss German is fluent in High German.

Part of the problem is that it is very hard to get good data on the
multiple languages that people speak—a huge number of people are fluent in
more than one—and on the level of fluency in each. That alone makes it
difficult to do accurate representations. That level of accuracy may not be
necessary to get a general picture, but when the map purports to go into
great detail...


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Wed, May 27, 2015 at 4:59 PM, Denis Jacquerye moy...@gmail.com wrote:

 The data used to build the infographic comes from Ethnologue.com.
 http://www.ethnologue.com/language/deu does not indicate the Standard
 German L1 population in Austria and gives a population of 727 000 Standard
 German L1 speakers in Switzerland (the difference is counted as Swiss
 German L1 speakers).

 On Wed, 27 May 2015 at 11:22 Mark Davis [image: ☕]️ m...@macchiato.com
 wrote:

 Hmmm. How accurate can it be? They forgot Austria, and got Switzerland
 wrong by almost a power of 10.


 Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*

 On Wed, May 27, 2015 at 10:18 AM, Denis Jacquerye moy...@gmail.com
 wrote:

 The South China Morning Post published a similar infographic:
 A world of languages - and how many speak them

 http://www.scmp.com/infographics/article/1810040/infographic-world-languages





Re: Tag characters

2015-05-18 Thread Mark Davis ☕️
​A few notes.

A more concrete proposal will be in a PRI to be issued soon, and people
will have a chance to comment more then. (I'm not trying to discourage
discussion, just pointing out that there will be something more concrete
relatively soon to comment on—people are pretty busy getting 8.0 out the
door right now.)

The principal reason for 3 digit codes is because that is the mechanism
used by BCP47 in case ISO screws up codes (as they did for CS).

The syntax does not need to follow the 3166 syntax - the codes correspond
but are not the same anyway. So we didn't see the necessity for the hyphen,
syntactically.

There is a difference between EU and UN; the former is in BCP47. That being
said, we could look at making the exceptionally reserved codes valid for
this purpose (or at least the UN code). It appears that there are only 3
exceptionally reserved codes that aren't in BCP47: EZ, UK, UN.

Just because a code is valid doesn't mean that there is a flag associated
with it. Just like the fact that you can have the BCP47 code ja-Ahom-AQ
doesn't mean that it denotes anything useful. I'd expect vendors to not
waste time with non-existent flags. However, we could also discuss having a
mechanism in CLDR to help provide guidelines as to which subdivisions are
suitable as flags.





Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Sat, May 16, 2015 at 10:07 AM, Doug Ewell d...@ewellic.org wrote:

 L2/15-145R says:

  On some platforms that support a number of emoji flags, there is
 substantial demand to support additional flags for the following:
 [...]
 Certain supra-national regions, such as Europe (European Union flag)
 or the world (e.g. United Nations flag). These can be represented
 using UN M49 3-digit codes, for example 150 for Europe or 001 for
 World.


 These are uncomfortable equivalence classes. Not all countries in Europe
 are members of the European Union, and the concept of United Nations is
 not really the same by definition as all countries in the world.

 The remaining UN M.49 code elements that don't have a 3166-1 equivalent
 seem wholly unsuited for this mechanism (and those that do, don't need it).
 There are no flags for Middle Africa or Latin America and the Caribbean
 or Landlocked developing countries.

 Some trans-national organizations might _almost_ seem as if they could be
 shoehorned into an M.49 code element, like identifying 035 South-Eastern
 Asia with the ASEAN flag, but this would be problematic for the same
 reasons as 150 and 001.

 Among the ISO 3166-1 exceptionally reserved code elements are EU for
 European Union and UN for United Nations. If these flags are the use
 cases, why not simply use those alpha-2 code elements, instead of burdening
 the new mechanism with the 3-digit syntax?


 --
 Doug Ewell | http://ewellic.org | Thornton, CO 



Re: Tag characters

2015-05-15 Thread Mark Davis ☕️
The consortium is in no position to enhance protocols *itself* for
exchanging images. That's firmly in other groups' hands. We can try to
noodge them a bit, but what *will* make a difference is when the *vendors*
of sticker solutions put pressure on the different groups responsible for
the protocols to provide interoperability for images. Because there is a
lot of growth in sticker solutions, I would expect there to be more such
pressure. And even so, I expect it will take those some time to be deployed.

We've said what our longer-term position is, and I think we all pretty much
agree with that; exchanging images is much more flexible. However, we do
have strong short-term pressure to show that we are responsive and
responsible in adding emoji. And our adding a reasonable number of emoji
per year is not going to stop Line or Skype from adding stickers!

There are a few possible scenarios, and it's hard to predict the results.
It could be that emoji are largely supplanted by stickers in 5 years; could
be 10; could be that they both coexist indefinitely. I have no , and
neither does anyone else...


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, May 14, 2015 at 7:44 PM, Peter Constable peter...@microsoft.com
wrote:

  And yet UTC devotes lots of effort (with an entire subcommittee) to
 encode more emoji as characters, but no effort toward any preferred longer
 term solution not based on characters.





 Peter



 *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Shervin
 Afshar
 *Sent:* Thursday, May 14, 2015 2:27 PM
 *To:* wjgo_10...@btinternet.com
 *Cc:* unicode@unicode.org
 *Subject:* Re: Tag characters



 Thinking about this further, could the technique be used to solve the
 requirements of
 section 8 Longer Term Solutions



 IMO, the industry preferred longer term solution (which is also discussed
 in that section with few existing examples) for emoji, is not going to be
 based on characters.




   ↪ Shervin



 On Thu, May 14, 2015 at 1:40 PM, William_J_G Overington 
 wjgo_10...@btinternet.com wrote:

  What else would be possible if the same sort of technique were applied
 to another base character?


 Thinking about this further, could the technique be used to solve the
 requirements of

 section 8 Longer Term Solutions

 of

 http://www.unicode.org/reports/tr51/tr51-2.html

 ?


 Both colour pixel map and colour OpenType vector font solutions would be
 possible.


 Colour voxel map and colour vector 3d solids solutions are worth thinking
 about too as fun coding thought experiments that could possibly lead to
 useful practical results.




 William Overington


 14 May 2015





FYI: The world’s languages, in 7 maps and charts

2015-05-12 Thread Mark Davis ☕️
http://www.washingtonpost.com/blogs/worldviews/wp/2015/04/23/the-worlds-languages-in-7-maps-and-charts/


Re: Script / font support in Windows 10

2015-05-08 Thread Mark Davis ☕️
Thanks!


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Fri, May 8, 2015 at 7:15 AM, Peter Constable peter...@microsoft.com
wrote:

  I think this is the right public link:



 https://msdn.microsoft.com/en-us/goglobal/bb688099.aspx





 *From:* Peter Constable
 *Sent:* Thursday, May 7, 2015 10:29 PM
 *To:* Peter Constable; unicode@unicode.org
 *Subject:* RE: Script / font support in Windows 10



 Oops… my bad: maybe it isn’t on live servers yet. It will be soon. I’ll
 update with the public link when it is.



 *From:* Unicode [mailto:unicode-boun...@unicode.org
 unicode-boun...@unicode.org] *On Behalf Of *Peter Constable
 *Sent:* Thursday, May 7, 2015 10:15 PM
 *To:* unicode@unicode.org
 *Subject:* Script / font support in Windows 10



 This page on MSDN that provides an overview of Windows support for
 different scripts has now been updated for Windows 10:



 https://msdnlive.redmond.corp.microsoft.com/en-us/bb688099







 Peter



Re: Ways to detect that XXXX in JSON \uXXXX does not correspond to a Unicode character?

2015-05-07 Thread Mark Davis ☕️
​The simplest approach would be to use ICU in a little program that scans
the file. For example, you could write a little Java program that would
scan the file, and turn any any sequence of (\u)+ into a String, then
test that string with:

static final UnicodeSet OK = new
UnicodeSet([^[:unassigned:][:surrogate:]]]).freeze();
...
// inside the scanning function
boolean isOk​ = OK.containsAll(slashUString);

It is key that it has to grab the entire sequence of \u in a row;
otherwise it will get the wrong answer.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, May 7, 2015 at 10:49 AM, Doug Ewell d...@ewellic.org wrote:

 Costello, Roger L. Costello at mitre dot org wrote:

  Are there tools to scan a JSON document to detect the presence of
  \u, where  does not correspond to any Unicode character?

 A tool like this would need to scan the Unicode Character Database, for
 some given version, to determine which code points have been allocated
 to a coded character in that version and which have not.

 --
 Doug Ewell | http://ewellic.org | Thornton, CO 




Combining character example

2015-04-16 Thread Mark Davis ☕️
I happened to run across a good example of productive use of combining
marks, the Duden site (a great online dictionary for German). They use
U+0323 (   ̣) COMBINING DOT BELOW to indicate the stress. Here is an
example:

ụnterbuttern

http://www.duden.de/rechtschreibung/unterbuttern

They aren't, however, consistent; you also see underlining for stress.

e̲i̲nschränken
But not, interestingly, with the HTML underline, but with U+0332 (  ̲  )
COMBINING LOW LINE.

Mark
https://google.com/+MarkDavis


Re: Combining character example

2015-04-16 Thread Mark Davis ☕️
Thanks for the corrections; I should have looked for a key to the
conventions they use.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, Apr 16, 2015 at 11:32 AM, Jörg Knappen jknap...@web.de wrote:

 Hi Mark,

 the use of DOT BELOW and LINE BELOW is in fact consistent in German Duden.
 The
 difference in the diacritics is used to denote length of the stressed
 vowel, DOT BELOW
 denotes a short vowel and LINE BELOW denotes a long vowel.

 Diphthongs are always long and there is a single line under the whole
 Diphthong.

 Digraphs (e.g. the ou in words borrowed from French) also have either a
 single line
 under the whole digraph or (this happens rarely) a single dot in the
 middle of the
 digraph.

  --Jörg Knappen

  *Gesendet:* Donnerstag, 16. April 2015 um 10:01 Uhr
 *Von:* Mark Davis [image: ☕]️ m...@macchiato.com
 *An:* Unicode Public unicode@unicode.org, Unicode Book 
 b...@unicode.org
 *Betreff:* Combining character example
   I happened to run across a good example of productive use of combining
 marks, the Duden site (a great online dictionary for German). They use
 U+0323 (   ̣) COMBINING DOT BELOW to indicate the stress. Here is an
 example:

 ụnterbuttern

   http://www.duden.de/rechtschreibung/unterbuttern

 They aren't, however, consistent; you also see underlining for stress.

  e̲i̲nschränken
 But not, interestingly, with the HTML underline, but with U+0332 (  ̲  )
 COMBINING LOW LINE.

Mark https://google.com/+MarkDavis




Re: Are you CONFUSED about WHAT CHARACTER(S) you type?!?!

2015-03-26 Thread Mark Davis ☕️
It only provides a stand-in glyph if you don't otherwise have a font for
that character on your system. That stand-in just indicates the type of
character (eg script).

No single font with current technology can handle all of Unicode. The most
complete open font set I know of is the Noto family:
https://www.google.com/get/noto/. I don't think it has a full set of
symbols (others: correct me if I'm wrong.) Symbola is pretty good for
arbitrary symbols.

There are many other resources on http://unicode.org/resources/fonts.html.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, Mar 26, 2015 at 8:53 AM, Michael McGlothlin 
mike.mcgloth...@gmail.com wrote:

 Similar but with a couple differences. Most important would be getting
 vendors to actually use the font. Also it should be appropriate to actually
 display the characters rather than being debugging information.

 Does this last resort font represent every character in some meaningful
 way? e.g. I've tried to use somewhat rare characters like runes before and
 it was a pretty big pain to find fonts that were free to distribute,
 weren't buggy, and displayed the correct symbol for that character. And
 some applications wouldn't display them correctly even after installing a
 font. (Visual Studio let me use runes as variable names and compiled fine
 but wouldn't actually display the rune symbols.)


 Sent from my iPad

 On Mar 25, 2015, at 5:18 PM, Shervin Afshar shervinafs...@gmail.com
 wrote:

 Just like Unicode Last Resort Font[1]?

  [1]: http://www.unicode.org/policies/lastresortfont_eula.html

 ↪ Shervin

 On Wed, Mar 25, 2015 at 2:24 PM, Michael McGlothlin 
 mike.mcgloth...@gmail.com wrote:

 I'd like to see a free/open default font that has a correct, simple
 styled, symbol for every Unicode character. Vendors should be pressured to
 use this font when other options aren't available. I get tired of seeing
 default symbols, incorrect symbols, and mystery white spaces that aren't
 really white space. It's pretty silly to have a code point without a
 default symbol I think.


 Thanks,
 Michael McGlothlin
 Sent from my iPhone

 On Mar 25, 2015, at 12:20 PM, Robert Wheelock rwhlk...@gmail.com wrote:

 Hello!

 When you’re typing, do you find yourself winding up being CONFUSED over
 what you type?!?!  It’s a crucially SERIOUS matter—especially when typing
 on a computer!

 For instance:  When you type in a HOLLOW HEART SUIT (U+02661), it may
 show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI
 (U+0039E)... it all DEPENDS on whatever FONT you’re using to type with!

 The default Microsoft Sans Serif font (within Microsoft Windows) has this
 ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at
 U+02261)—because Microsoft (regrettably) placed this math symbol where the
 HOLLOW HEART SUIT should be (at U+02661)! * ¡AGONISTES!*

 What Microsoft SHOULD DO *is* *THIS*:  Please move the IDENTICAL TO SIGN
 from (U+02661—the location where the HOLLOW HEART SUIT goes) to its PROPER
 LOCATION at (U+02261)!!  THAT would be MUCH better!!

 What other CHARACTER CALAMITIES have you come across?!?!

 Thank You!


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Android 5.1 ships with support for several minority scripts

2015-03-14 Thread Mark Davis ☕️
Congrats!

{phone}
On Mar 14, 2015 03:09, Roozbeh Pournader rooz...@unicode.org wrote:

 Android 5.1
 http://officialandroid.blogspot.com/2015/03/android-51-unwrapping-new-lollipop.html,
 released earlier this week, has added support for 25 minority scripts. The
 wide coverage can be reproduced by almost everybody for free, thanks to the
 Noto https://code.google.com/p/noto/ and HarfBuzz
 http://www.freedesktop.org/wiki/Software/HarfBuzz/ projects, both of
 which are open source. (Android itself is open source too.)

 By my count, these are the new scripts added in Android 5.1: Balinese,
 Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah
 Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra,
 Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and
 Tifinagh.

 (Android 5.0, released last year, had already added the Georgian lari,
 complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new
 scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati,
 Gurmukhi, Sinhala, and Yi.)

 Note that different Android vendors and carriers may choose to ship more
 fonts or less, but Android One http://www.android.com/one/ phones and
 most Nexus http://www.google.com/nexus/ devices will support all the
 above scripts out of the box.

 None of this would have been possible without the efforts of Unicode
 volunteers who worked hard to encode the scripts in Unicode. Thanks to the
 efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the
 world would can now read and write their language on smartphones and
 tablets for the first time.


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Emoji (was: Re: Unicode block for programming related symbols and codepoints?)

2015-02-10 Thread Mark Davis ☕️
We are being pretty conservative about what we add. There are approximately
1,200 emoji characters now (see tr51), and we're anticipating adding
perhaps 50 per release. And we are encouraging a sticker approach for the
longer term.

On the other hand, I wouldn't be surprised if the 41 emoji characters that
we are planning on for Unicode 8.0 end up having a higher frequency of use
than the other 7K characters in the release.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Mon, Feb 9, 2015 at 9:36 PM, Michael Everson ever...@evertype.com
wrote:

 I like symbols a lot. But I know that I and a number of people have been
 thinking that too much emphasis is being put on emoji.

 Michael Everson * http://www.evertype.com/


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Emoji (was: Re: Unicode block for programming related symbols and codepoints?)

2015-02-10 Thread Mark Davis ☕️
 In what character encoding standard, or extension, does ROBOT FACE appear?

Unicode has never been limited to what is in other character encoding
standard or extensions, official or de facto.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Mon, Feb 9, 2015 at 9:16 PM, Doug Ewell d...@ewellic.org wrote:

 Shervin Afshar shervinafshar at gmail dot com wrote:

  There is no longer any requirement that the robot faces and
  burritos appear first in any sort of industry character set
  extension, with which Unicode is then obliged to maintain
  compatibility.
 
  Only if you don't consider existing usage and popular requests as
  requirement and precedence; for example Gmail had Robot Face for a
  long time.

 I said there was no longer a requirement *that the items appear first in
 an industry character set extension*, right?

 In what character encoding standard, or extension, does ROBOT FACE
 appear? Gmail has it is not a character encoding standard. Neither is
 People want to see it.

 Most popularly requested, as a criterion for adding a character, is
 absolutely new to Unicode. Earlier I wrote privately to a Unicode
 officer about whether PERSON TAKING SELFIE and GIRL TWERKING and PERSON
 DUMPING ICE BUCKET OVER HEAD would be ephemeral enough, and got no
 reply. (What, you've forgotten the ice-bucket craze already? That's
 exactly why most popular at the moment wasn't supposed to be a
 criterion.)

 --
 Doug Ewell | Thornton, CO, USA | http://ewellic.org


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: About cultural/languages communities flags

2015-02-09 Thread Mark Davis ☕️
On Tue, Feb 10, 2015 at 12:11 AM, Ken Whistler kenwhist...@att.net wrote:

 for the full context, and for the current 26x26 letter matrix which is
 the basis for the flag glyph implementations of regional indicator
 code pairs on smartphones.

 SC, SO, ST are already taken, but might I suggest putting in for
 registering
 AB for Alba? That one is currently unassigned.

 Yeah, yeah, what is the likelihood of BSI pushing for a Scots two-letter
 code?! But seriously, if folks are planning ahead for Scots independence
 or even some kind of greater autonomy, this is an issue that needs to
 be worked, anyway.

 In the meantime, let me reiterate that there is *no* formal relationship
 between TLD's and the regional indicator codes in Unicode (or the
 implementations
 built upon them). Well, yes, a bunch of registered TLD's do match the
 country
 codes, but there is no two-letter constraint on TLD's. This should already
 be apparent, as Scotland has registered .scot At this point there isn't
 even
 a limitation of TLD's to ASCII letters, so there is no way to map them
 to the limited set of regional indicator codes in the Unicode Standard.

 Not having a two letter country code for Scotland that matches the
 four letter TLD for Scotland might indeed be a problem for someone,
 but I don't see *this* as a problem that the Unicode Standard needs
 to solve.


​I want to add to that that there are already a fair number of ISO 2-letter
codes for regions that are administered as part of another country, like
Hong Kong. There are also codes for crown possessions like Guernsey. So
having a code for Scotland (and Wales, and N. Ireland) do not really break
precedent. But as Ken says, the best mechanism is for the UK to push for a
code in ISO and the UN.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: UAX 29 questions

2015-01-30 Thread Mark Davis ☕️
I apology in advance that I'm running low on time, and didn't go through
all the messages on this thread carefully. So I may not be fully
appreciating people's positions. I'm just making some quick points about 2
items that caught my eye.


1. There are certainly times where two rules in sequence may overlap, just
for simplicity.

X Y* x Z
Y x Z* W

The first rule could trigger on X Y Z W, even though the second would also
trigger on it. This may or may not be sloppiness; sometimes it simply
makes the second rule too convoluted to also exclude triggering on
everything that could possibly trigger earlier.

That being said, if there simplifications in the rules that would make it
clearer, I'd suggest submitting a proposal for that. The UTC is meeting
next week, and could consider it either then or at subsequent meetings.

Note: the HTML files in http://unicode.org/Public/UNIDATA/auxiliary/ have a
number of sample cases (which are also used in the test files). Hovering
over boundaries in those sample cases shows which rule is triggered, such
as in
http://unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakTest.html#samples

We're always open to additional samples that are illustrative of how the
rules work. As I thought about your message, it became clear to me that it
would be useful to have a complete enough set of sample cases that each
rule is triggered by at least one case, if you or anyone else is interested
in helping to add those.


2. Also, the following 2 rules are not equivalent:

a) Any  × (Format | Extend)
b) X (Extend | Format)* → X

(b) implies (a), but not the reverse. The difference is on the right side
of characters. Rule b, affects every subsequent rule, and can be viewed as
a shorthand. After it, we can just say:

A B × C D

And that has the effect of saying:

A (Extend | Format)* B (Extend | Format)* × C (Extend | Format)* D

See also http://unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules

However, it may not be clear that (b) implies (a); that might be what you
are getting at. If so, then we could add an explicit statement to that
effect.



Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, Jan 29, 2015 at 7:52 PM, Karl Williamson pub...@khwilliamson.com
wrote:

 On 01/25/2015 05:14 AM, Philippe Verdy wrote:

 This is not a contradiction.


 At the very least it is too sloppy for a standard.  Once there is a match
 in the list of rules, later rules shouldn't have to be looked at.  I'll
 submit a formal feedback form.

 But there is another issue as well.  I do not see how the specified rules
 when applied to the sequence of code points:

 U+0041 U+200D U+0020

 cause the ZWJ, an Extend, to not break with the A, an ALetter.

 Rule WB4 is

 Ignore Format and Extend characters, except when they appear at the
 beginning of a region of text..

 Not clearly stated, but it appears to me that the ZWJ must be considered
 here to be the beginning of a region of text, as we are looking at the
 boundary between it and the A.  No rule specifically mentions ALetter
 followed by an Extend, so by the default rule, WB14

 Otherwise, break everywhere (including around ideographs)

 this should be a word break position.  But that is absurd, as the Extend
 is supposed to extend what precedes it.  If I add a rule

 Don't break before Extend or Format
 × (Extend | Format)

 my implementation passes all tests.  I added this rule before WB4.



 combine the two rules and they are equivalent to these two alternate
 rules:
 WB56 can be read as these two:

   (WB56a) ALetter  ×  (MidLetter | MidNumLet | Single_Quote) (ALetter |
 Hebrew_Letter)

   (WB56b) Hebrew_Letter  ×  (MidLetter | MidNumLet | Single_Quote)
 (ALetter | Hebrew_Letter)


 Then add :

(WB57) Hebrew_Letter ×  Single_Quote

 it just removes the condition of a letter following the quote  in WB56b.
 So that WB56b and WB57 can be read as equivalent to these two:

   (WB56c) Hebrew_Letter  ×  (MidLetter | MidNumLet) (ALetter |
 Hebrew_Letter)

   (WB57) Hebrew_Letter × Single_Quote

 But you cannot merge any of these two last rules in a single rule for
 WB56.


 2015-01-25 7:26 GMT+01:00 Karl Williamson pub...@khwilliamson.com
 mailto:pub...@khwilliamson.com:

 I vaguely recall asking something like this before, but if so, I
 didn't save the answers, and a search of the archives didn't turn up
 anything.

 Some of the rules in UAX #29 don't make sense to me.

 For example, rule WB7a
Hebrew_Letter ×   Single_Quote

 seems to say that a Hebrew_Letter followed by a Single Quote
 shouldn't break.  (And Rule WB4 says that actually there can be
 Extend and Format characters between the two and those should be
 ignored).

 But the earlier rule, WB6

   (ALetter | Hebrew_Letter)  ×   (MidLetter | MidNumLet |
 Single_Quote) (ALetter | Hebrew_Letter)

 seems to me to say (among other things) that a Hebrew Letter
 

Re: (R), (c) and ™

2014-12-18 Thread Mark Davis ☕️
On Thu, Dec 18, 2014 at 11:31 AM, Andrea Giammarchi 
andrea.giammar...@gmail.com wrote:

 standard variant sensitive


​It is not clear what you mean by standard variant sensitive​. Can you
elaborate?



Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: (R), (c) and ™

2014-12-18 Thread Mark Davis ☕️
Note that emoji ≠ present in
http://www.unicode.org/Public/UNIDATA/EmojiSources.txt

It would probably be useful to read through
http://www.unicode.org/reports/tr51/, which is where we are working on
various aspects of emoji, in your case especially

   - http://www.unicode.org/reports/tr51/#Identification
   - http://www.unicode.org/reports/tr51/#Presentation_Style

There are charts attached to the TR that can also be reviewed (and
commented on), such as
http://www.unicode.org/Public/emoji/1.0/text-style.html

If you have feedback on the data (either supporting what is there, or
recommending changes), you can submit your feedback via a link to Feedback
(found at the top, and in the review notes for each of the sections).


We haven't yet made firm recommendations on the variation selectors or the
default emoji style, so what is there is a fairly a raw draft. (but we are
making progress; see https://plus.google.com/+MarkDavis/posts/MLqEc79yN22).

Personally, I think that if a character is in the recommended list for
emoji, then:

   - if the default style is text, we must have variation selectors.
   - if the default style is emoji, then we should have variation selectors
   if it is in common use with a non-emoji presentation (typical for
   characters that have been in Unicode for a long time).



Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, Dec 18, 2014 at 12:09 PM, Andrea Giammarchi 
andrea.giammar...@gmail.com wrote:

 Thanks Mark, I mean not listened anywhere here:
 http://unicode.org/Public/UNIDATA/StandardizedVariants.txt

 I'd expect to find the following there:

 00A9 FE0E; text style;  # COPY RIGHT MARK
 00A9 FE0F; emoji style; # COPY RIGHT MARK


 for the simple reason that 00A9 is listed as emoji:
 http://www.unicode.org/Public/UNIDATA/EmojiSources.txt

 Apparently there's no place that says FE0F should affect 00A9, neither a
 place that states the opposite: 00A9 FE0E as text.

 Are my expectations wrong or should these chars handled any differently
 from other emoji ?

 Thanks


 On Thu, Dec 18, 2014 at 11:03 AM, Mark Davis [image: ☕]️ 
 m...@macchiato.com wrote:


 On Thu, Dec 18, 2014 at 11:31 AM, Andrea Giammarchi 
 andrea.giammar...@gmail.com wrote:

 standard variant sensitive


 ​It is not clear what you mean by standard variant sensitive​. Can you
 elaborate?



 Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: emoji are clearly the current meme fad

2014-12-17 Thread Mark Davis ☕️
We just had a new blog posting; we've moved the media list out of tr51, and
the list already had that item on it. See:

http://www.unicode.org/press/emoji.html#media

Separately, I keep a list of how the media refers to the Unicode
consortium: my favorite is shadowy emoji overlords.

Bonus points to the first person who can find the one that refers to us as
part of a shameful plot to destroy the institution of marriage...


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Tue, Dec 16, 2014 at 6:36 PM, Asmus Freytag asm...@ix.netcom.com wrote:

  Everybody wants in on the act:

 http://mashable.com/2014/12/12/bill-nye-evolution-emoji/

 A./

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: emoji are clearly the current meme fad

2014-12-17 Thread Mark Davis ☕️
On Wed, Dec 17, 2014 at 9:03 PM, Murray Sargent 
murr...@exchange.microsoft.com wrote:


 http://www.theguardian.com/commentisfree/2014/nov/28/the-problem-with-emojis


​Bingo, Murray wins the prize!

[image: Inline image 1]​

​Not to open until Christmas...
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The rapid ... erosion of definition ability

2014-11-17 Thread Mark Davis ☕️
On Mon Nov 17 2014 at 12:15:08 PM Andreas Stötzner a...@signographie.de
wrote:


 Am 17.11.2014 um 11:46 schrieb Leonardo Boiko:

 Sign is too general


 in its generality it is just perfect. The sets of signs in question are
 most general, covering much more matters, objects and topics than the
 actual emoticons.


 They’re just signs and that’s it.

The term 'emoji' is certainly a useful term for people to use, denoting a
certain kind of symbol. Saying that one should never use it is like saying
that one should never say dog or cat, only the generic animal...


 The UCS defines the 1F600 set properly as Emoticons. At least, we should
 (in English) speak of Emoticons and not Emoji.


Not really (and we don't really define them as emoticons; that's just the
block name—and arguably should should have been different).



 Other “symbols” (another misnomer i.m.h.o., but that’s another story)


Not, at least, in English.


 of this kind are termed “Miscellaneous Symbols and Pictographs”. This is
 not bad but unprecise as well since many of these signs are not pictographs
 but ideographs.


We warn people in multiple places that the names of blocks are *not*
reliable guides to the kinds of characters in the block.


 Yeah what the heck ;)

 We have a long tradition of naming these things rather lousy (“Dingbats”).
 I am a traditionalist as a matter of fact but if precise terming is tricky
 I find it better to generalize than to blur.


I generally agree about the utility of having generic terms in a language.
Listening to Swiss newscasts, I find it bizarre to hear pretty clumsy
phrasing that is the equivalent of the following (because there is a
different form for male and female of many nouns).

— The politicians(m) and politicians(f) met with the directors(m) and
directors(f), writers(m) and writers(f), and actors(m) and actresses.

We suffer from it much less in English, mostly with he and she,
although clearly the use of they as a gender-neutral signular is on the
upswing (although it's been around for centuries).

However, what is most useful is when there are generic terms, *plus*
specific ones.






 ___

 Andreas Stötzner  Gestaltung Signographie Fontentwicklung

 Haus des Buches
 Gerichtsweg 28, Raum 434
 04103 Leipzig
 0176-86823396

 http://stoetzner-gestaltung.prosite.com


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: The rapid ... erosion of definition ability

2014-11-17 Thread Mark Davis ☕️
I agree (except for the derivation of emoji).

On Mon Nov 17 2014 at 11:46:58 AM Leonardo Boiko leobo...@namakajiri.net
wrote:

 Sign is too general.  The word has no less than 12 meanings, and can
 refer e.g. to many Unicode characters that are not emojis (the sharp
 sign, the less-than sign).[1]

 It's useful to have a specialized word  referring specifically to the new
 pictograms used to color electronic messages with emotional inflection.
 Borrowing is a perfectly adequate and natural strategy to get such a word
 into a language – as indeed English did with the word sign, from Old
 French *signe * Latin *signum* ; and as Japanese did with the English
 word *emotion *, from which the *emo-*  in *emoji, *and with Chinese,
 from which *-ji* written character.

 If borrowing words when they're useful is ridiculous, then all languages
 are ridiculous, and when everything is ridiculous nothing is.


 [1] http://en.wiktionary.org/wiki/sign



 2014-11-17 8:09 GMT-02:00 Andreas Stötzner a...@signographie.de:


 Am 17.11.2014 um 08:35 schrieb Mark Davis ☕️:

 IT’S EASY TO DISMISS EMOJI. They are, at first glance, ridiculous


 The only ridiculous thing is to name them “Emoji” outside Japan.
 They’re just signs and that’s it.


 Regards,
 Andreas Stötzner.





 ___

 Andreas Stötzner  Gestaltung Signographie Fontentwicklung

 Haus des Buches
 Gerichtsweg 28, Raum 434
 04103 Leipzig
 0176-86823396

 http://stoetzner-gestaltung.prosite.com



















 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


The rapid evolution of a wordless tongue

2014-11-16 Thread Mark Davis ☕️
 http://nymag.com/daily/intelligencer/2014/11/emojis-rapid-evolution.html

A more extended article from NY Magazine about the growing usage of emoji,
and the ways in which that usage is developing. Has a quote from Peter
Constable and (indirect) reference to +Steven R. Loomis.

 “IT’S EASY TO DISMISS EMOJI. They are, at first glance, ridiculous. They
are a small invasive cartoon army of faces and vehicles and flags and food
and symbols trying to topple the millennia-long reign of words. Emoji are
intended to illustrate, or in some cases replace altogether, the words we
send each other digitally, whether in a text message, email, or tweet.
Taken together, emoji look like the electronic equivalent of those puffy
stickers tweens used to ornament their Trapper Keepers. And yet...”
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Emoji skin tone modifiers on the website of a leading German daily newspaper

2014-11-08 Thread Mark Davis ☕️
As far as I can tell it is garnering interest all over.. Several German
publications, including Spiegel, to French and Italian regional papers, to
Indonesian, Vietnamese

http://www.spiegel.de/netzwelt/web/unicode-consortium-emojis-demnaechst-fuer-alle-hautfarben-a-1001125.html

http://m.baohay.vn/chuyen-de/cong-nghe/961227/Bieu-tuong-Emoji-se-co-mau-da-thay-doi.html

{phone}
On Nov 8, 2014 12:04 AM, Karl Pentzlin karl-pentz...@acssoft.de wrote:

 FYI: On 2014-11-05, a report on Emoji skin tone modifiers was published on
 the website of the Frankfurter Allgemeine, a leading German daily
 newspaper:

 http://www.faz.net/aktuell/gesellschaft/emoticons-smileys-bald-in-fuenf-hautfarben-13249783.html
 - Karl Pentzlin

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Open Source Emoji for the Web

2014-11-07 Thread Mark Davis ☕️
One can definitely script it; if you hadn't had compat issues it would be
convenient to have the same convention.

On Thu Nov 06 2014 at 11:30:09 PM Andrea Giammarchi 
andrea.giammar...@gmail.com wrote:

 Thanks Mark,
   I will consider this change with CDN chaps too since that would
 invalidate already a lot of cached content at the time it'll ship :-/

 We should have paid more attention, on the other side if you need assets
 locally instead of via CDN a script capable of renaming assets from current
 form to your suggested one seems straight forward to me.

 Would that (sort of) work?

 Thanks



 On Fri, Nov 7, 2014 at 12:18 AM, Mark Davis ☕️ m...@macchiato.com wrote:

 Very nice.

 I'd have one suggestion. People appear to be converging on similar file
 names for the emoji.

- Lowercase hex numbers,
- at least 4 digits,
- otherwise no leading zeros,
- multiple code points separated by _,
- with optional prefix/suffix.

 Like dcm_0030_20e3.png. I'd suggest using that convention.

 Not a big thing, but makes it more consistent in tooling.


 Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*

 On Thu, Nov 6, 2014 at 3:27 PM, Andrea Giammarchi 
 andrea.giammar...@gmail.com wrote:

 I'd like to thank those that helped me a while ago figuring out variants
 and emoji behavior.

 Today we are open sourcing a relatively small JS library and 800+ CDN
 based assets able to bring unified emoji in every WebView capable device
 and browser.

 We are also planning to implement the recently introduced diversity
 for the Unicode 8 draft as soon as we'll figure out a good approach for it
 ( and btw, the default fallback is great! )

 This effort and collaboration is between Twitter [1], MaxCDN [2], and
 Wordpress [3].

 Any comment or suggestion will be more than welcome and appreciated.

 Thanks again and Best Regards

 [1]
 https://blog.twitter.com/2014/open-sourcing-twitter-emoji-for-everyone
 [2] https://www.maxcdn.com/blog/emojis-ftw/
 [3] http://en.blog.wordpress.com/2014/11/06/emoji-everywhere/


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode




___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


keynote

2014-11-06 Thread Mark Davis ☕️
As an experiment, we recorded the keynote at the Unicode Conference. I
posted them at

http://macchiati.blogspot.com/2014/11/unicode-emoji.html

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Open Source Emoji for the Web

2014-11-06 Thread Mark Davis ☕️
Very nice.

I'd have one suggestion. People appear to be converging on similar file
names for the emoji.

   - Lowercase hex numbers,
   - at least 4 digits,
   - otherwise no leading zeros,
   - multiple code points separated by _,
   - with optional prefix/suffix.

Like dcm_0030_20e3.png. I'd suggest using that convention.

Not a big thing, but makes it more consistent in tooling.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Thu, Nov 6, 2014 at 3:27 PM, Andrea Giammarchi 
andrea.giammar...@gmail.com wrote:

 I'd like to thank those that helped me a while ago figuring out variants
 and emoji behavior.

 Today we are open sourcing a relatively small JS library and 800+ CDN
 based assets able to bring unified emoji in every WebView capable device
 and browser.

 We are also planning to implement the recently introduced diversity for
 the Unicode 8 draft as soon as we'll figure out a good approach for it (
 and btw, the default fallback is great! )

 This effort and collaboration is between Twitter [1], MaxCDN [2], and
 Wordpress [3].

 Any comment or suggestion will be more than welcome and appreciated.

 Thanks again and Best Regards

 [1] https://blog.twitter.com/2014/open-sourcing-twitter-emoji-for-everyone
 [2] https://www.maxcdn.com/blog/emojis-ftw/
 [3] http://en.blog.wordpress.com/2014/11/06/emoji-everywhere/


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about a Normalization test

2014-10-23 Thread Mark Davis ☕️
On Thu, Oct 23, 2014 at 6:54 PM, Aaron Cannon 
cann...@fireantproductions.com wrote:

 0061 05AE 0305 0300 0315 0062


http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cu0061+%5Cu05AE+%5Cu0305+%5Cu0300+%5Cu0315+%5Cu0062g=ccc

​0305 and 0300 have the same ccc, so the first one blocks the second.

http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G49576

The older spec is shorter, although not as precise:
http://www.unicode.org/reports/tr15/tr15-29.html#Specification

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


fonts for U7.0 scripts

2014-10-22 Thread Mark Davis ☕️
I'm looking for freely downloadable TTF fonts for any of the following.
I'd appreciate links to sites for any of these:

   1. Bassa_Vah
   2. Duployan
   3. Grantha
   4. Khojki
   5. Khudawadi
   6. Mahajani
   7. Mende_Kikakui
   8. Modi
   9. Mro
   10. Nabataean
   11. Old_Permic
   12. Palmyrene
   13. Pau_Cin_Hau
   14. Tirhuta
   15. Warang_Citi

Coverage doesn't need to be complete, and
​the font
 doesn't need to support shaping (these are just for charts /
illustrations).

Mark
https://google.com/+MarkDavis
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: What happened to...?

2014-09-20 Thread Mark Davis ☕️
I agree that we should minute at least some reason for declining. It need
only be a sentence or two.

(BTW I wasn't at that discussion.)

{phone}
On Sep 20, 2014 3:17 AM, Asmus Freytag asm...@ix.netcom.com wrote:

 On 9/19/2014 5:38 PM, Whistler, Ken wrote:

 Michael,

  Declines to take action” is pretty thin.

 A proposal which is declined by the UTC doesn't automatically
 create an obligation to write an extended dissertation explaining
 the rationale and putting that rationale on record. It might be
 one thing if there were a lot of controversy involved, and one
 group of participants asked for a rationale to be recorded,
 despite not having a consensus to move on something -- but
 this one wasn't even close. Nobody in the committee felt
 encoding was justified in this case.

 And not every mark on paper -- not even every mark *printed*
 in typeset material on paper -- is automatically an obvious
 candidate for encoding with a simple, plain text character
 representation.


 True, but a rationale (note that's not necessarily a dissertation) never
 hurts.

 Declines to take action” may look like it is equivalent to Nobody in the
 committee felt
 encoding was justified in this case, but it really isn't. The former
 allows for all sorts of non-substantive reasons, but the latter is pretty
 clear: the submitter failed to make the case.

 What you are looking for is something equivalent to summary dismissal of
 a legal action, but even there this usually gets some rationale or it has
 the benefit of a standardized legal principle (don't know for a fact, but
 sounds plausible).



 A./


 --Ken


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: FYI: Ruble sign in Windows

2014-08-14 Thread Mark Davis ☕️
Cool, congratulations!


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Thu, Aug 14, 2014 at 3:52 PM, Peter Constable peter...@microsoft.com
wrote:

  For those interested, there is an update for Windows available now to
 add font, keyboard and locale data support for the Ruble sign that was
 added in Unicode 7.0. For details, see here:



 http://support.microsoft.com/kb/2970228









 Peter

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: meaningful and meaningless FE0E

2014-06-29 Thread Mark Davis ☕️
These variation selector characters only apply to specific characters,
those listed in

http://unicode.org/Public/UNIDATA/StandardizedVariants.html

There is a machine-readable version at
http://unicode.org/Public/UNIDATA/StandardizedVariants.txt


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Sun, Jun 29, 2014 at 8:47 AM, Andrea Giammarchi 
andrea.giammar...@gmail.com wrote:

 ok, here the simplified version of my question:

 would U+1F21A followed by U+FE0E be represented differently from what U+1F21A
 is normally?

 is such sequence even a real concern or intent specified anywhere? (no,
 can't find it, asking just confirmation)

 Thanks a lot for any outcome!

 Best Regards


 On Sat, Jun 28, 2014 at 10:33 AM, Andrea Giammarchi 
 andrea.giammar...@gmail.com wrote:

 Dear all,
   this is my first email in this channel so apologies in advance if
 already discussed.

 I am trying to understand the expected behavior when there an unexpected
 VS15 after emoji that have not been defined, accordingly with this file
 http://www.unicode.org/Public/UNIDATA/NamesList.txt, as VS15 sensitive.

 My take on FE0E is that all emoji that are sensible to this variant, have
 an emojified counter part that should be used when followed by FE0F and
 vice-versa a textual part when followed by FE0E, but all other emoji should
 not consider such variant at all since there's no textual counter part to
 represent, let's say, a 1F21A pile-of-poo

 \ud83d\udca9\ufe0e

 Can anyone please confirm my expectations are correct so that above
 sequence in both Java or JavaScript will show the POP emoji regardless,
 followed by FE0E variant that will be simply ignored and actually no
 device/OS/render/viewer/browser would ever create such sequence so it's
 actually a non problem, this one I am trying to solve?

 Thanks in advance and Best Regards



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Swift

2014-06-05 Thread Mark Davis ☕️
I haven't done any analysis, but on first glance it looks like it is based
on

http://www.unicode.org/reports/tr31/#Alternative_Identifier_Syntax


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Thu, Jun 5, 2014 at 5:46 PM, Jeff Senn s...@maya.com wrote:

 Has anyone figured out whether character sequences that are non-canonical
 (de)compositions but could be recomposed to the same result
 are the same identifier or not?

 That is: are identifiers merely sequences of characters or intended to be
 comparable as “Unicode strings” (under some sort of compatibility rule)?

 On Jun 5, 2014, at 11:27 AM, Martin v. Löwis mar...@v.loewis.de wrote:

  Am 04.06.14 11:28, schrieb Andre Schappo:
  The restrictions seem a little like IDNA2008. Anyone have links to
  info giving a detailed explanation/tabulation of allowed and non
  allowed Unicode chars for Swift Variable and Constant names?
 
  The language reference is at
 
 
 https://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html
 
  For reference, the definition of identifier-character is (read each
  line as an alternative)
 
  identifier-character → Digit 0 through 9
  identifier-character → U+0300–U+036F, U+1DC0–U+1DFF, U+20D0–U+20FF, or
  U+FE20–U+FE2F
  identifier-character → identifier-head­
 
  where identifier-head is
 
  identifier-head → Upper- or lowercase letter A through Z
  identifier-head → U+00A8, U+00AA, U+00AD, U+00AF, U+00B2–U+00B5, or
  U+00B7–U+00BA
  identifier-head → U+00BC–U+00BE, U+00C0–U+00D6, U+00D8–U+00F6, or
  U+00F8–U+00FF
  identifier-head → U+0100–U+02FF, U+0370–U+167F, U+1681–U+180D, or
  U+180F–U+1DBF
  identifier-head → U+1E00–U+1FFF
  identifier-head → U+200B–U+200D, U+202A–U+202E, U+203F–U+2040, U+2054,
  or U+2060–U+206F
  identifier-head → U+2070–U+20CF, U+2100–U+218F, U+2460–U+24FF, or
  U+2776–U+2793
  identifier-head → U+2C00–U+2DFF or U+2E80–U+2FFF
  identifier-head → U+3004–U+3007, U+3021–U+302F, U+3031–U+303F, or
  U+3040–U+D7FF
  identifier-head → U+F900–U+FD3D, U+FD40–U+FDCF, U+FDF0–U+FE1F, or
  U+FE30–U+FE44
  identifier-head → U+FE47–U+FFFD
  identifier-head → U+1–U+1FFFD, U+2–U+2FFFD, U+3–U+3FFFD, or
  U+4–U+4FFFD
  identifier-head → U+5–U+5FFFD, U+6–U+6FFFD, U+7–U+7FFFD, or
  U+8–U+8FFFD
  identifier-head → U+9–U+9FFFD, U+A–U+AFFFD, U+B–U+BFFFD, or
  U+C–U+CFFFD
  identifier-head → U+D–U+DFFFD or U+E–U+EFFFD
 
  As the construction principle for this list, they say
 
  Identifiers begin with an upper case or lower case letter A through Z,
  an underscore (_), a noncombining alphanumeric Unicode character in the
  Basic Multilingual Plane, or a character outside the Basic Multilingual
  Plan that isn’t in a Private Use Area. After the first character, digits
  and combining Unicode characters are also allowed.
 
  Regards,
  Martin
  ___
  Unicode mailing list
  Unicode@unicode.org
  http://unicode.org/mailman/listinfo/unicode


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Swift

2014-06-04 Thread Mark Davis ☕️
Apparently you can use emoji in the identifiers. 

(
http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/
)


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Wed, Jun 4, 2014 at 11:28 AM, Andre Schappo a.scha...@lboro.ac.uk
wrote:

 Swift is Apple's new programming language. In Swift, variable and constant
 names can be constructed from Unicode characters. Here are a couple of
 examples from Apple's doc
 http://developer.apple.com/library/prerelease/ios/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html

 let π = 3.14159
 let 你好 = 你好世界

 I think this a huge step forward for i18n and Unicode.

 There are some restrictions on which Unicode chars can be used. From
 Apple's doc

 Constant and variable names cannot contain mathematical symbols, arrows,
 private-use (or invalid) Unicode code points, or line- and box-drawing
 characters. Nor can they begin with a number, although numbers may be
 included elsewhere within the name.

 The restrictions seem a little like IDNA2008. Anyone have links to info
 giving a detailed explanation/tabulation of allowed and non allowed Unicode
 chars for Swift Variable and Constant names?

 André Schappo


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 10:32 PM, David Starner prosfil...@gmail.com wrote:

 Why? It seems you're changing the rules
 ​...


This isn't are changing, it is has changed. The Corrigendum was issued
at the start of 2013, about 16 months ago; applicable to all relevant
earlier versions. It was the result of fairly extensive debate inside the
UTC; there hasn't been a single issue on this thread that wasn't considered
during the discussions there. And as far back as 2001, the UTC made it
clear that noncharacters *are* scalar values, and are to be converted by
UTF converters. Eg, see
http://www.unicode.org/mail-arch/unicode-ml/y2001-m09/0149.html (by chance,
one day before 9/11).

 probably trigger serious bugs in some lamebrained utility.

There were already plenty of programs that passed the noncharacters
through; very few would filter them (some would delete them, which is
horrible for security). Thinking that a utility would never encounter them
in input text was a pipe-dream. If a utility or library is so fragile that
it *breaks* on input of any valid UTF sequence, then it *is* a lamebrained
utility. A good unit test for any production chain would be to check there
is no crash on any input scalar value (and for that matter, any ill-formed
UTF text).
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-03 Thread Mark Davis ☕️
On Tue, Jun 3, 2014 at 9:41 AM, David Starner prosfil...@gmail.com wrote:

 Thinking that a utility would never mangle them if encountered in
 input text was a pipe-dream.


I didn't say not mangle, I said break, as in crash.

​I don't think this thread is going anywhere productive, so​ I'm signing
off from it.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Mark Davis ☕️
 \uD808\uDF45 specifies a sequence of two codepoints.

​That is simply incorrect.​

In Java (and similar environments), \u means a char (a UTF16 code
unit), not a code point. Here is the difference. If you are not used to
Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x
with the replacement y in string. Backslashes in literals need escaping, so
\x needs to be written in literals as \\x.

String[] tests = {\\x{12345}, \\uD808\\uDF45, \uD808\uDF45,
«.»};
String target =
 one: «\uD808\uDF45»\t\t +
two: «\uD808\uDF45\uD808\uDF45»\t\t +
lead: «\uD808»\t\t +
trail: «\uDF45»\t\t +
one+: «\uD808\uDF45\uD808»;
System.out.println(pattern + \t→\t + target + \n);
for (String test : tests) {
  System.out.println(test + \t→\t + target.replaceAll(test, §︎));
}


*​Output:*
pattern → one: «⍅» two: «⍅⍅» lead: «?» trail: «?» one+: «⍅?»

\x{12345} → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
\uD808\uDF45 → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
⍅ → one: «§︎» two: «§︎§︎» lead: «?» trail: «?» one+: «§︎?»
«.» → one: §︎ two: «⍅⍅» lead: §︎ trail: §︎ one+: «⍅?»

The target has various combinations of code units, to see what happens.
Notice that Java treats a pair of lead+trail as a single code point for
matching (eg .), but also an isolated surrogate char as a single code point
(last line of output). Note that Java's regex in addition allows \x{hex}
for specifying a code point explicitly. It also has the syntax \u (in a
literal the \ needs escaping) to specify a code unit; that is slightly
different than the Java preprocessing. Thus the first two are equivalent,
and replace { by x. The last two are also equivalent—and fail—because a
single { is a broken regex pattern.

System.out.println({.replaceAll(\\u007B, x));
System.out.println({.replaceAll(\\x{7B}, x));

System.out.println({.replaceAll(\u007B, x));
System.out.println({.replaceAll({, x));



Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Sun, Jun 1, 2014 at 7:04 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 On Sun, 1 Jun 2014 08:58:26 -0700
 Markus Scherer markus@gmail.com wrote:

  You misunderstand. In Java, \uD808\uDF45 is the only way to escape a
  supplementary code point, but as long as you have a surrogate pair,
  it is treated as a code point in APIs that support them.

 Wasn't obvious that in the following paragraph \uD808\uDF45 was a
 pattern?

 Bear in mind that a pattern \uD808 shall not match anything in a
 well-formed Unicode string. \uD808\uDF45 specifies a sequence of two
 codepoints. This sequence can occur in an ill-formed UTF-32 Unicode
 string and before Unicode 5.2 could readily be taken to occur in an
 ill-formed UTF-8 Unicode string.  RL1.7 declares that for a regular
 expression engine, the codepoint sequence U+D808, U+DF45 cannot
 occur in a UTF-16 Unicode string; instead, the code unit sequence D808
 DF45 is the codepoint sequence U+12345 CUNEIFORM SIGN URU TIMES
 KI.

 (It might have been clearer to you if I'd said '8-bit' and '16-bit'
 instead of UTF-8 and UTF-16.  It does make me wonder what you'd call a
 16-bit encoding of arbitrary *codepoint* sequences.)

 Richard.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
The problem is where to draw the line. In today's world, what's an app? You
may have a cooperating system of apps, where it is perfectly reasonable
to interchange sentinel values (for example).

I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
where we should make it clearer.)


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:02 PM, Shawn Steele shawn.ste...@microsoft.com
wrote:

  I also think that the verbiage swung too far the other way.  Sure, I
 might need to save or transmit a file to talk to myself later, but apps
 should be strongly discouraged for using these for interchange with other
 apps.



 Interchange bugs are why nearly any news web site ends up with at least a
 few articles with mangled apostrophes or whatever (because of encoding
 differences).  Should authors’ tools or feeds or databases or whatever
 start emitting non-characters from internal use, then we’re going to have
 ugly leak into text “everywhere”.



 So I’d prefer to see text that better permitted interchange with other
 components of an application’s internal system or partner system, yet
 discouraged use for interchange with “foreign” apps.



 -Shawn



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com
wrote:

 The “problem” is now that previously these characters were illegal


The problem was that we were inconsistent in standard and related material
about just what the status was for these things.



Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
I disagree with that characterization, of course.

The recommendation for libraries and low-level tools to pass them through
rather than screw with them makes them usable. The recommendation to check
for noncharacters from unknown sources and fix them was good advice then,
and is good advice now. Any app where input of noncharacters causes
security problems or crashes is and was not a very good app.


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Mon, Jun 2, 2014 at 6:37 PM, Asmus Freytag asm...@ix.netcom.com wrote:

  On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:


 On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com
 wrote:

 The “problem” is now that previously these characters were illegal


  The problem was that we were inconsistent in standard and related
 material about just what the status was for these things.


   And threw the baby out to fix it.

 A./


  Mark https://google.com/+MarkDavis

  *— Il meglio è l’inimico del bene —*


 ___
 Unicode mailing 
 listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-05-31 Thread Mark Davis ☕️
I think you have a point here. We should probably change to:

To meet this requirement, an implementation shall supply a mechanism for
specifying any Unicode scalar value (from U+ to U+D7FF and U+E000 to
U+10), using the hexadecimal code point representation.

and then in the notes say that the same notation can be used for codepoints
that are not scalar values, for implementation that handle them in Unicode
strings.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


On Fri, May 30, 2014 at 8:45 PM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 Is there any good reason for UTS#18 'Unicode Regular Expressions' to
 express its requirements in terms of codepoints rather than scalar
 values?

 I was initially worried by RL1.1 requiring that one be able to specify
 surrogate codepoints in a pattern.  It would not be compliant for an
 application to reject such patterns as syntactically or semantically
 incorrect!  RL1.1 seemed to prohibit compliant regular expression
 engines that only handled well-formed UTF-8 strings.

 Furthermore, consider attempting to handle CESU-8 text as a sequence of
 UTF-8 code units.  The code unit sequence for U+1 will,
 corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80
 ED B0 80. If one follows the lead of the 'best practice' for processing
 ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this
 will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0,
 and 80.  I am not aware of any recommendation as to how to interpret
 these sequences as codepoints.

 While being able to specify a search for surrogate codepoint U+D800
 might be useful when dealing with ill-formed UTF-16 Unicode sequences,
 UTS#18 Section 1.7, which discusses requirement RL1.7, states that there
 is no requirement for a one-codepoint pattern such as \u{D800} to match
 a UTF-16 Unicode string consisting just of one code unit with the value
 0xD800.  The convenient, possibly intended, consequence of this is that
 the RL1.1 requirement to allow patterns to specify surrogate codepoints
 can be satisfied by simply treating them as unmatchable; For example,
 such a 1-character RE could be treated as the empty Unicode set
 [\p{gc=Lo}  \p{gc=Mn}].

 Now, I suppose one might want to specify a match for ill-formed (in
 context) UTF-8 code unit subsequences such as E0 80 (not a valid
 initial subsequence) and E0 A5 (lacking a trailing byte), but as
 matching is not required, I don't see the point in UTS#18 being
 changed to ask for an appropriate syntax to be added.

 Richard.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Long-Encoded Restricted Characters in High Frequency Modern Use

2014-05-31 Thread Mark Davis ☕️
Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Fri, May 30, 2014 at 12:39 AM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 I am a little confused by the call for a review of UTS #39, Unicode
 Security Mechanisms (PRI #273).  Are we being requested to
 report long-encoded 'restricted' characters in high frequency modern
 use?  'Restricted' refers to the classification in
 xidmodifications.txt.


​First, restricted are meant not for everyday use, bu​t specifically just
for the purpose of programming identifiers and similar sorts of
identifiers. Moreover, it sets up a framework, but the conformance
requirements are only that any modification is declared.

http://www.unicode.org/reports/tr39/proposed.html#C1

You may know this all, but just to be sure.

​


 One linked pair of long-encoded restricted characters in high frequency
 use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM,
 which occurs in the extremely common Thai and Lao words for 'water' or
 'liquid in general' น้ำ ນ້ຳ whose NFKC decompositions are the
 nonsensical forms น้ํา ນ້ໍາ, but may be faked by the linguistically
 incorrect นํ้า ນໍ້າ.  In Thai the encodings are U+0E19 THAI CHARACTER
 NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI CHARACTER SARA AM,
 U+0E19, U+0E49, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32 THAI CHARACTER
 SARA AA and U+0E19, U+0E49, U+0E4D, U+0E49, U+0E32.


The structure of the data is based on the use of NFKC characters in
identifiers. So SARA AM and the Lao​ equivalent are both not NFKC
characters, and are categorized as such, and would need to be represented
by their NFKC fors. The process is in
http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection

You can see the categorization (for 6.3) for a whole script with a link
like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restrictiona=\p{sc=thai}

(It only works for 6.3 right now, but these items haven't changed recently.)



 Now, U+0E4D THAI
 CHARACTER NIKHAHIT is classified as 'allowed; recommended', although
 its main use is in writing Pali, which would suggest that it should be
 'restricted; historic' or 'restricted; limited-use'.


​For that, it would be best to submit via
http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
feedback form at http://www.unicode.org/reporting.html, just to be sure.
​

 The situation is
 not so clear for Lao
 - U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language.




​Based on your information, ​the following appear (at least to me) to be
caused by typos in  in the xidmodifications source files; they are all
marked as 'technical'.

http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restrictiona=\p{sc=khmer}

Again, best to submit this like above (via
http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
feedback form at http://www.unicode.org/reporting.html).


 To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER
 SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as
 'restricted; technical'. They are all in use in the Khmer language.

 U+17CB KHMER SIGN BANTOC is required for the main methods of writing
 the Khmer vowels /a/ and /ɑ/.

 U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn
 that it has recently become little-used.  It is, however, readily
 confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main
 modern use is to show that a consonant is silent, rather like the Thai
 letter U+0E4C THAI CHARACTER THANTHAKHAT.  (The names are the same.)
 The confusion arises because Sanskrit -rCa was pronounced /-r/ in
 Khmer, and final /r/ recently became silent in Khmer, so the effect of
 the Sanskrit /r/ is now to silence the final consonant.

 While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be
 common, they are still in modern use.

 Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in
 frequency, it has not dropped out of use and is still a common enough
 way of writing the vowel /a/.




 Richard.

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corrigendum #9

2014-05-31 Thread Mark Davis ☕️
A few quick items. (I admit to only skimming your response, Phillipe; there
is only so much time in the day.)

Any discussion of changing non-characters is really pointless. See
http://www.unicode.org/policies/property_value_stability_table.html

As to breaking up the block, that is not forbidden: but one would have to
give pretty compelling arguments that the benefits would outweigh any
likely problems, especially since we already don't recommend the use of the
block property in regexes.

 And regular expressions trying to use character properties have many more
caveats to handle (the most serious being with canonical equivalences and
discontinuous matches or partial matches.

The UTC, after quite a bit of work, concluded that it was not feasible with
today's regex engines to handle normalization automatically, instead
recommending the approach in
http://www.unicode.org/reports/tr18/#Canonical_Equivalents

 Regexps are still a very experimental proposal, they are still very
difficult to make interoperatable except in a small set of tested cases

I have no idea where this is coming from. Regexes using Unicode properties
are in widespread and successful use. It is not that hard to make them
interoperable (as long as both implementations are using the same version
of Unicode).


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Sat, May 31, 2014 at 9:36 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 May be; but there's real doubt that a regular expression that would need
 this property would be severely broken if that property was corrected.
 There are many other properties that are more useful (and mich more used)
 whose associated set of codepoints changes regularly across versions.

 I don't see any specific interest in maintaining non-characters in that
 block, as it effectively reduces the reusaibility of this property.
 And in fact it would be highly preferable to no longer state that these
 non-characters in ArabicPresenationForm be treated like C1 controls or PUA
 (because they will ever be reassigned to something more useful). Making
 them PUA would not change radically the fact thzt these characters are not
 recommended but we xould no longer bother about checking if they are valid
 or not. They remain there only as a legacy with old outdated versions of
 Unicode for a mysterious need that Ive not clearly identified.

 Let's assume we change them into PUA; some applications will start
 accepting them when some other won't. Not a problem given that they are
 already not interoperable.

 And regular expressions trying to use character properties have many more
 caveats to handle (the most serious being with canonical equivalences and
 discontinuous matches or partial matches; when searches are only focuing on
 exact sets of code points instead of sets of canonical equivalent texts;
 the other complciation coming with the effect of collation and its variable
 strength matching more or less parts of text spanning ignorable collation
 elements i.e, possibly also, discontinuous runs of ignorable codepoints if
 we want to get consistant results independant of th normalization form.
 more compicate is how to handle partial matches such as a combining
 character within a precomposed character which is canonically equivalent to
 string where this combining character appears

 And even more tricky is how to handle substitution with regexps, for
 example when perfrming search at primary collation level ignoring
 lettercase, but when we wnt to replace base letters but preserve case in
 the substituted string: this requires specific lookup of characters using
 properties **not** specified in the UCD but in the collation tailoring
 data, and then how to ensure that the result of the substitution in the
 plain-text source will remain a valid text not creating new unexpected
 canonical equivalences, and that it will also not break basic orthographic
 properties such as syllabic structures in a specific pair of
 language+script, and without also producing unexpected collation
 equivalents at the same collation strength; causing later unexpected never
 ending loops of subtitutions, for example in large websites with bots
 operating text corrections).

 Regexps are still a very experimental proposal, they are still very
 difficult to make interoperatable except in a small set of tested cases and
 for this reason I really doubt that the characetrs encoding block
 property is very productive for now with regexps (and notably not with this
 compatibility block, whose characters wll remain used isolately
 independantly of their context, if they are still used in rare cases).

 I see little value in keeping this old complication in this block, but
 just more interoperability problems for implementations. So these non
 characters should be treated mostly like PUA, except that they have a few
 more properties : direction=RTL, script= Arabic, and starters working in
 isolation for the Arabic 

Re: Unicode Sets in 'Unicode Regular Expressions'

2014-05-27 Thread Mark Davis ☕️
They are defined in http://unicode.org/reports/tr35/tr35.html#Unicode_Sets.
We should add a pointer to that; could you please file a feedback report
for #18 to that effect?

Also, if you find any problems in the description in #35, you can file a
ticket at http://unicode.org/cldr/trac/newticket to get them addressed.


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Wed, May 28, 2014 at 12:18 AM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 UTS#18 'Unicode Regular Expressions' Version 17 Requirement RL1.3
 'Subtraction and Intersection' talks of Unicode sets.  What is the
 relevant definition of a 'Unicode set'? Is it a finite set of non-empty
 strings?  Other possibilities that occur to me, depending on context,
 include sets of codepoints and sets of indecomposable codepoints.

 Richard.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Mark Davis ☕️
On 25 April 2014 20:53, Karl Williamson pub...@khwilliamson.com wrote:

 And in fact in some Unicode releases, they contained errors.


I think you know this, but for others.

A derived property value in the UCD is defined by the value in the derived
data file, NOT by the derivation.​ Of course, the value might not follow
the intent, just with any other property, and there are fixes to
properties, whether derived or not, in each release. And sometimes the
statement of the derivation is changed, and sometimes property values are
changed.

And the regex recommendations in
http://www.unicode.org/reports/tr18/#Compatibility_Properties are
different, so you may be referring to them.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unclear text in the UBA (UAX#9) of Unicode 6.3

2014-04-22 Thread Mark Davis ☕️
We try not to do that. There are some known holes, like RBNF. if you know
of others please file a ticket.

{phone}
On Apr 21, 2014 9:18 PM, Doug Ewell d...@ewellic.org wrote:

 From: Asmus Freytag asmusf at ix dot netcom dot com wrote:

  In general, I heartily dislike specifications that just narrate a
  particular implementation...

 I agree completely. I see this with CLDR as well; there is a more or
 less implicit assumption that I will be using ICU to implement whatever
 is being described. I don't care how robust and well-tested a wheel is;
 as a developer, I should be able to use the specification to reinvent it
 if I like.

 --
 Doug Ewell | Thornton, CO, USA
 http://ewellic.org | @DougEwell


 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Updated emoji working draft

2014-04-15 Thread Mark Davis ☕️
On 15 April 2014 13:14, William_J_G Overington wjgo_10...@btinternet.comwrote:

 If the UTC (Unicode Technical Committee) accepts the introduction of
 read-out labels, each read-out label both linked to a pictograph character
 and also linked to a language-localization text string, then that will be a
 far-reaching enhancement to Unicode which may have enormous implications
 for facilitating communication through the language barrier.


 If the UTC (Unicode Technical Committee) accepts the introduction of
read-out labels

The passage just points out that those can exist, the document does not
provide any data for that.

 If there were on the webpage emoji for Surname, Forename, Delivery
address, Card number

I can't see any possible future in which emoji like that are encoded.

As I said before, please move this discussion to another email subject.
Otherwise, I'll take a step I should have long ago, and simply filter out
all email coming from you.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Updated emoji working draft

2014-04-14 Thread Mark Davis ☕️
This is really off topic. If you want to start up a thread about this,
please use a different subject.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


On 14 April 2014 16:01, William_J_G Overington wjgo_10...@btinternet.comwrote:

 Here are two examples each of a symbol together with accompanying text in
 Venice.

 The symbol is global and the text is local.


 https://maps.google.com/maps?q=Venice,+Italyhl=enll=45.432399,12.337928spn=0.000702,0.001124sll=37.0625,-95.677068sspn=26.039016,36.826172oq=venicehnear=Venice,+Veneto,+Italyt=mlayer=ccbll=45.432473,12.337638panoid=YazHmOmqVm1q5CZ2H7klMQcbp=12,16.36,,0,8.23z=19

 Going full screen and zooming-in is helpful.

 William Overington

 14 April 2014

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Updated emoji working draft

2014-04-12 Thread Mark Davis ☕️
On 12 April 2014 11:46, William_J_G Overington wjgo_10...@btinternet.comwrote:
​...​

In March 2014 I published the attached document, depositing a copy with the
 British Library.


 The_format_of_the_translit.dat_file_suggested_for_possible_use_for_transliteration.pdf

 Is this format suitable to become standardized for use in producing
 localized text-to-speech from emoji to the chosen local language?

​
no​, not particularly




Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Updated emoji working draft

2014-04-12 Thread Mark Davis ☕️
On 12 April 2014 16:54, William_J_G Overington wjgo_10...@btinternet.comwrote:

 Would it be good, for an emoji that is not encoded in regular Unicode, to
 include mention of the possibility of transmission by markup bubble,
 rendered upon reception as an unmapped glyph by an OpenType colour font?

 For example, as nine Unicode characters.

 COLON COLON U1 U2 U3 U4 U5 COLON SEMICOLON

 This would perhaps not always allow new emoji to be added as quickly as
 with embedded graphics, yet with this technique, the message could be
 archived as plain text and would be searchable and text-to-speech would be
 possible at the receiving end.


​I don't think anything like what you suggest would be feasible, or
desirable.

Longer term, I think the most feasible approach is the interchange of
embedded graphics, which can always have alt values (at least in html) for
readings.


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Bidi reordering of soft hyphen

2014-04-02 Thread Mark Davis ☕️
I tend to agree with Roozbeh and Behdad. I would expect to find the visible
appearance of the hyphen replacing the letters that were broken off from
the last word. That is, if the word was beekeeper, I'd expect to see:

 bee- .

That would be no matter where the word occurred, and no matter what the
direction of the paragraph or surrounding text. (If the SHY occurred at a
directional boundary, I'd also say we don't care much...)

In any event, once we come up with an agreed recommendation, I'd suggest an
implementation note like Asmus describes, but rather than talk about
algorithmic steps, just point out the desired visual behavior (since there
are many ways to do it).



Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On 1 April 2014 23:43, Asmus Freytag asm...@ix.netcom.com wrote:

  I think this calls for an implementation note on UAX#9 along these lines.
 -
 During line breaking, if a line is broken at the location of a SHY, the
 text around the line break may change. A common case is the replacement of
 the invisible SHY by a visible HYPHEN, but see Section x.x in the Unicode
 Standard.

 For the purposes of the Bidi Algorithm, apply steps .. to .. after any
 substitutions have been made, using the directional classes for the
 substituted characters, instead of a single BN for the SHY character.

 example

 Note, no special action need be taken for a SHY character in the middle of
 a line, unless they are rendered as visible glyphs in a show hidden
 character mode. In the latter case, the recommendation would be to treat
 the visible symbol substituted for the SHY as having bidi class ON.
 

 I am not sure whether -car CBA or car- CBA is the right answer, nor
 whether the substitution will always be limited to the preceding line. (Old
 orthography German had BäcSHYker turning in to Bäk-|ker, where I've used
 | to show the line ending.) Those are details that the UBA should be
 ignorant about. The important thing is that the array of bidi directional
 classes is not constrained to contain a single entry for BN at the location
 of the original SHY.

 If car- CBA is the right answer then the substitution would have to be
 HYPHEN plus LRM to get this to come out right, but that would be under the
 control of the line-breaking conventions, and not legislated by the UBA.

 A./


 On 4/1/2014 1:31 PM, Whistler, Ken wrote:

  Richard Wordingham noted:



  As U+2010 HYPHEN would result in text like 'car-', in an English

  influenced context I would also go with 'car-'.



 That's always a possibility, I suppose, but I'm not sure what

 English influenced context means here.



 The examples I just gave were for a RTL paragraph context.

 In a LTR paragraph context, the same input would end up in

 a very different order:



 Trace: Entering br_UBA_ReverseLevels [L2]

 Current State: 19

   Text:05D0 05D1 05D2 0020 0063 0061 0072 002D

   Bidi_Class: RRRLLLLL

   Levels: 11100000

   Runs:L---L



   Order:  [2 1 0 3 4 5 6 7]



 And you get the display:



 CBA car-

 -



 As opposed to:



 -car CBA

 -



 In either case, the hyphen-minus (or hyphen), ends up at the *end of the
 line*.



 My take is that *if* I am going to insert a visible glyph at the point of
 the

 SHY, it would probably be best to insert it at the actual line break at the

 end of the line, to be in the same position as an explicit hyphen-minus
 with

 the same line break.



 --Ken






 ___
 Unicode mailing 
 listUnicode@unicode.orghttp://unicode.org/mailman/listinfo/unicode



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


FYI: More emoji from Chrome

2014-04-01 Thread Mark Davis ☕️
More emoji from Chrome:

http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html

with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: FYI: More emoji from Chrome

2014-04-01 Thread Mark Davis ☕️
Yup!


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On 1 April 2014 09:13, Philippe Verdy verd...@wanadoo.fr wrote:

 April 1st joke...


 2014-04-01 9:01 GMT+02:00 Mark Davis ☕️ m...@macchiato.com:

 More emoji from Chrome:

 http://chrome.blogspot.ch/2014/04/a-faster-mobiler-web-with-emoji.html

 with video: https://www.youtube.com/watch?v=G3NXNnoGr3Y

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Mark Davis
They do have aliases in NameAliases.txt

;NULL;control

;NUL;abbreviation

0001;START OF HEADING;control

0001;SOH;abbreviation

0002;START OF TEXT;control

0002;STX;abbreviation

...


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Wed, Mar 12, 2014 at 1:32 PM, Per Starbäck starb...@stp.lingfil.uu.sewrote:

 Ken Whistler wrote:
  Ah, I see what the interpretation problem was. Yes, that is
  a straightforward kind of improvement -- easily enough done.
  Look for a change the next time the file is updated. (It will not
  be immediately changed, pending other review comments.)

 Thanks! Then I'll skip making a formal request about this.

 Regarding these names in ISO 6429 again, how come these control
 characters don't have Unicode names? For many uses of names, the control
 characters have as much need for them as any other character.
 Since it seems so straightforward it must have been suggested several
 times to introduce names like

   CONTROL CHARACTER NULL
   CONTROL CHARACTER START OF HEADING
   CONTROL CHARACTER START OF TEXT

 etc., so I assume there are good reasons for not doing that, but I can't
 see what they are.

 Since applications want names they will use other things as names when
 there isn't a real name, and that leads to problems. Take Emacs where
 the command describe-char currently describes U+0007 as

   name: control
   old-name: BELL

 (I reported the misusage of control here as a name in 2009, but it
 wasn't fixed until this year, so still not in a released version.)
 The usage of BELL here invites confusion with U+1F514 BELL.

 Emacs should do better regarding this, but still, with a proper name
 all of this would have been averted.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: NFD - NFC

2014-03-11 Thread Mark Davis
Not sure about your exact case, but ICU's normalization does handle those
characters.

http://unicode.org/cldr/utility/transform.jsp?a=nfc%3Bhexb=%5Cu30B9%5Cu3099

(That tool uses ICU for NFC).


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


On Tue, Mar 11, 2014 at 4:50 PM, Markus Doppelbauer doppelba...@gmx.netwrote:

 Hello,

 I have an other problem making the normalization process binary
 compatible with ICU.
  Why does 30B9 3099 not combine to 30BA?

 Steps to reproduce:
  wget http://doppelbauer.name/katakana.txt
 uconv -f utf8 -t utf8 -x nfd katakana.txt ndf.txt
 uconv -f utf8 -t utf8 -x nfc ndf.txt nfc.txt
 diff katakana.txt nfc.txt

  Expected result: katakana.txt == nfc.txt

 uconv v2.1  ICU 4.8.1.1

 Thanks a lot
 Markus



 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Unicode organization is still anti-Serbian and anti-Macedonian

2014-02-14 Thread Mark Davis
Unicode is not anti-Serbian or Macedonian.

The exact level of Unicode support will depend on your operating system and
font choice. For example, on the Mac there are reasonable results with
arbitrary
accents. Here are examples with q,U+0308 and Q,U+0308

q̈

Q̈

Here is an image, in case your emailer or OS doesn't handle these well.

[image: Inline image 1]
See also http://www.unicode.org/standard/where/

As to the italic, that also depends on the font support on your system.



Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


On Fri, Feb 14, 2014 at 2:37 AM, Крушевљанин pe...@muchomail.com wrote:

 There is still problem with letters бгдпт in italic, and б in regular mode.

 OpenType support is still very weak (Firefox, LibreOffice on Linux,
 Adobe's software and that's it, practically). It's also disappointing that
 Microsoft is still incapable to implement and force this support on system
 level.

 Also, there are Serbian/Macedonian cyrillic vowels with accents (total: 7
 types × 6 possible letters = 42 combinations) where majority of them don't
 exist precomposed, and is impossible to enter them. A lot of nowadays'
 fonts (even commercial) still have issues with accents.

 In Unicode, Latin scripts are always favored, which is simply not fair to
 the rest of the world. They have space to put glyphs for dominoes, a lot of
 dead languages etc. but they don't have space for real-world issues.

 I want Unicode organization to change their politics and pay attention to
 small countries like Serbia and Macedonia. We have real-world problems.
 Thank you.

 If you think these are biases of me, I say — real-world problem for us.
 If you think changes would invalidate existing texts, I say — no, because
 *real* Serbian/Macedonian support still doesn't exist! And we can develop
 converters in the future, so I don't see any huge cost problems...

 --
 Крушевљанин Иван

 _
 The Free Email with so much more!
 = http://www.MuchoMail.com =

 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

inline: Screen Shot 2014-02-14 at 12.56.52.png___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: CJK IDS database

2014-01-14 Thread Mark Davis
Boy, I'd forgotten about those. There is an open-source collection of IDSs
that I used to create those files. Unfortunately, I found that *that* data
would take a lot of cleanup.

I do agree that it would be very useful to have an open-source repository
of IDSs for Unicode characters, but I don't know of one. Others?


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


On Wed, Jan 15, 2014 at 4:36 AM, Michel Suignard mic...@suignard.comwrote:

  I guess you should ask the owner, our distinguished president.

 Michel



 *From:* Unicode [mailto:unicode-boun...@unicode.org] *On Behalf Of *Andrew
 Pantyukhin
 *Sent:* Tuesday, January 14, 2014 4:06 PM
 *To:* unicode@unicode.org
 *Subject:* CJK IDS database



 Hi!

 I find Ideographic Description Sequences massively useful for studying and
 describing Chinese characters. However, I found only one comprehensive
 source of them — http://macchiato.com/ids/


 Does anyone know where the files come from? Were they part of the IRG
 process, or just an isolated effort? What are the private use characters in
 the sequences?

 I'd like to contribute to the IDS database and incorporate it into
 products like wiktionary and rikaikun.



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Language Death

2013-12-05 Thread Mark Davis
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0077056

with a popular article at
http://www.washingtonpost.com/blogs/worldviews/wp/2013/12/04/how-the-internet-is-killing-the-worlds-languages/

The source article was interesting, although I'd take issue with some of
their methodology.

The WP gloss takes some liberties; in particular, the source says The
latest (2012/02/28) publicly available version of the [SIL] database
distinguishes 7,776 languages while the WP leaps to the conclusion that
…at least 7,776 languages are in use in the greater offline world.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Mark Davis
These are two well-known serious flaws in EAI and URLs; there is no useful
syntactic limit on what is in the query part of a URL or on the local part
of an email address that would allow their boundaries to be detected in
plaintext.

No use complaining about them, because people are concerned with backwards
compatibility, and wouldn't change the underlying specs.

That being true, I wish that industry could come to consensus about
requiring everything outside of a well-defined, backwards-compatible set of
characters to be expressed as UTF-8 percent-escaped characters in these
fields when they are expressed as plaintext. (Something like XID_Continue ±
exceptions.) That would allow for unambiguous parsing in plaintext.


Mark https://google.com/+MarkDavis
*
*
*— Il meglio è l’inimico del bene —*
**


On Thu, Oct 31, 2013 at 8:37 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 How can it suarprisingly work if you need to safely embed an
 email address as an URI in a plain text document ? Yes there's way to worak
 with the IDNA part, but the local part is a challenge, that will require
 (to make it work) that the mail server will accept several aliased account
 names, depending on the document in which the address was embedded and
 encoded before being dereferenced and used to send mails.

 There's no easy way to embed the local part in plain-text when it can be
 arbitrary sequences of bytes in the non-ASCII range, whose encoding in the
 target domain name is unpredictable without first querying the MX server
 for that domain for this info, or without retrying sending mails with
 several guesses: these guesses with retries may cause privacy issues for
 the legitimate owner of non-ASCII email accounts (another reasons for using
 email of verification/confirmation of the owner, before sending him private
 messages).

 2013/10/31 Shawn Steele shawn.ste...@microsoft.com

  I think that’s true for non-ASCII non-EAI locale parts as well.  It’s
 so inconsistent its surprising when it works?





Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Mark Davis
I'm not saying that what is sent to the server has to be those bytes; I'm
saying that if we use the convention that punctuation, whitespace, etc gets
escaped, it would allow us to recognize the boundaries of the local part in
plain text.

I think what you mention is part of a more general problem. Let's suppose
that I have an email address where the bytes that the server recognizes for
the local part are 61 B3@foo.com. I convert that using Latin-14 to aġ@
foo.com. I send it in an email to you, and you receive it as UTF-8. You see
aġ@foo.com, but underneath the covers it is bytes 61 C4 A1. But then you
send to the server 61 C4 A1@foo.com, and it fails. Or worse yet, reaches
someone whose email is aġ@foo.com. (Ok, I could have poked around and
found a more compelling example, but you see the point).

If I really wanted to be absolutely certain that my email wouldn't be
munged by a conversion, I'd never convert from bytes: we'd never see 
m...@foo.com, we'd always see the equivalent of %6d%61%72...@foo.com.






Mark https://google.com/+MarkDavis
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Nov 1, 2013 at 1:36 PM, Philippe Verdy verd...@wanadoo.fr wrote:



 2013/11/1 Mark Davis ☕ m...@macchiato.com

 These are two well-known serious flaws in EAI and URLs; there is no
 useful syntactic limit on what is in the query part of a URL or on the
 local part of an email address that would allow their boundaries to be
 detected in plaintext.

 No use complaining about them, because people are concerned with
 backwards compatibility, and wouldn't change the underlying specs.

 That being true, I wish that industry could come to consensus about
 requiring everything outside of a well-defined, backwards-compatible set of
 characters to be expressed as UTF-8 percent-escaped characters in these
 fields when they are expressed as plaintext. (Something like XID_Continue ±
 exceptions.) That would allow for unambiguous parsing in plaintext.


 Why UTF-8 only ? There exists already email accounts created with
 various ISO8859-* or windows codepages, or KOI-8R (or U). And none of these
 addresses are aliased with an UTF-8 encoded account name reaching the same
 mailbox (creting these aliases would help these users having such accounts
 to protect their privacy, however there may exist rare cases where these
 aliases woulda conflict with distinct mail accounts



Re: full-width Latin missing from confusables data

2013-10-29 Thread Mark Davis
FYI, I just submitted a doc to the UTC for the upcoming meeting:

#36  #39 Recommendations

http://goo.gl/NKeRVB

If there is any feedback you'd like me to incorporate in a revision before
the meeting, please let me know.

Mark


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 15, 2013 at 8:53 PM, Mark Davis ☕ m...@macchiato.com wrote:

  but as Michel mentioned the data
 does not seem consistent in that case.
 ​

 You might add that to your report​...



 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **


 On Tue, Oct 15, 2013 at 7:23 PM, Chris Weber ch...@lookout.net wrote:

 On 10/14/2013 12:40 AM, Mark Davis ☕ wrote:
  For the confusables, the presumption is that implementations have
  already either normalized the input to NFKC or have rejected input that
  is not NFKC.

 Thanks for the explanation Mark.  It makes sense for implementations
 which want to detect confusability, but as Michel mentioned the data
 does not seem consistent in that case.  Another case could be
 implementations which want to generate confusable strings for testing -
 do you think those could be improved by having this extra data?  For
 example:

 http://unicode.org/cldr/utility/confusables.jsp?a=mr=None

  It would probably be worth clarifying this in the text of
  http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
  upcoming UTC meeting at the start of Nov., so if you want to suggest
  that or any other improvements, you should use the
  http://www.unicode.org/reporting.html.

 Thank you, I'll file a report.

 --
 Best regards,
 Chris Weber - ch...@lookout.net - http://www.lookout.net
 PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7





Re: Terminology question re ASCII

2013-10-28 Thread Mark Davis
Normally the term ASCII just refers to the 7-bit form. What is sometimes
called 8-bit ASCII is the same as ISO Latin 1. If you want to be
completely clear, you can say 7-bit ASCII.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 29, 2013 at 5:12 AM, d...@bisharat.net wrote:

 Quick question on terminology use concerning a legacy encoding:

 If one refers to plain ASCII, or plain ASCII text or ... characters,
 should this be taken strictly as referring to the 7-bit basic characters,
 or might it encompass characters that might appear in an 8-bit character
 set (per the so-called extended ASCII)?

 I've always used the term ASCII in the 7-bit, 128 character sense, and
 modifying it with plain seems to reinforce that sense. (Although plain
 text in my understanding actually refers to lack of formatting.)

 Reason for asking is encountering a reference to plain ASCII describing
 text that clearly (by presence of accented characters) would be 8-bit.

 The context is one of many situations where in attaching a document to an
 email, it is advisable to include an unformatted text version of the
 document in the body of the email. Never mind that the latter is probably
 in UTF-8 anyway(?) - the issue here is the terminology.

 TIA for any feedback.

 Don Osborn

 Sent via BlackBerry by ATT





Re: full-width Latin missing from confusables data

2013-10-15 Thread Mark Davis
 but as Michel mentioned the data
does not seem consistent in that case.
​

You might add that to your report​...



Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Tue, Oct 15, 2013 at 7:23 PM, Chris Weber ch...@lookout.net wrote:

 On 10/14/2013 12:40 AM, Mark Davis ☕ wrote:
  For the confusables, the presumption is that implementations have
  already either normalized the input to NFKC or have rejected input that
  is not NFKC.

 Thanks for the explanation Mark.  It makes sense for implementations
 which want to detect confusability, but as Michel mentioned the data
 does not seem consistent in that case.  Another case could be
 implementations which want to generate confusable strings for testing -
 do you think those could be improved by having this extra data?  For
 example:

 http://unicode.org/cldr/utility/confusables.jsp?a=mr=None

  It would probably be worth clarifying this in the text of
  http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
  upcoming UTC meeting at the start of Nov., so if you want to suggest
  that or any other improvements, you should use the
  http://www.unicode.org/reporting.html.

 Thank you, I'll file a report.

 --
 Best regards,
 Chris Weber - ch...@lookout.net - http://www.lookout.net
 PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7



Re: full-width Latin missing from confusables data

2013-10-14 Thread Mark Davis
For the confusables, the presumption is that implementations have already
either normalized the input to NFKC or have rejected input that is not
NFKC.

More broadly, in gathering data the main emphasis is on characters that fit
the profile in http://www.unicode.org/reports/tr39/#Identifier_Characters,
including scripts like Cyrillic (
http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). So while
we do add characters outside of that, there has been no concerted effort to
do so.

In particular, in your identifiers you should not allow scripts like
Buginese (
http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers)
or
Lisu (http://www.unicode.org/reports/tr31/#Table_Limited_Use_Scripts)
without recognizing that the confusable data will be sketchy for those.

It would probably be worth clarifying this in the text of
http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an
upcoming UTC meeting at the start of Nov., so if you want to suggest that
or any other improvements, you should use the
http://www.unicode.org/reporting.html.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sun, Oct 13, 2013 at 7:36 PM, Chris Weber ch...@lookout.net wrote:

 While looking closer at the current confusables data, I've noticed that
 several of the fullwidth code points seem to be missing from the
 confusables data. For example, U+FF4D FULLWIDTH LATIN SMALL LETTER M
 does not exist as a confusable for U+006D LATIN SMALL LETTER M, as well
 as several others I've noticed.

 Was this intentional?

 Also, I'm not clear on the difference between the confusables.txt and
 confusablesSummary.txt - are these meant to provide the same data in
 different formats?

 --
 Best regards,
 Chris Weber - ch...@lookout.net - http://www.lookout.net
 PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7




Re: More additional Greek (and Hebrew) characters needed for proposal

2013-09-21 Thread Mark Davis
http://www.unicode.org/faq/char_combmark.html#9 and following.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sat, Sep 21, 2013 at 7:38 PM, Robert Wheelock rwhlk...@gmail.com wrote:

 Hello again, y’all!

 I’ve got quite a few characters (currently missing) that DO need proposal
 for inclusion!  I typed up a document (for the new Fontboard polytonic
 Greek/Coptic keyboard layouts) that list the Unicode hexidecimal numerical
 values for the polytonic/monotonic Greek precomposed characters, and found
 out that (at least) 17 vowel/accent combos are still missing:

 H-C IOTA and UPSILON with both DIALYTIKA and ACCENTS (8 precomposed
 characters)
 H-C ALPHA, ĒTA, and ŌMEGA with both PROSGEGRAMMENĒ and ACCENTS (9
 precomposed characters).

 Besides those, there’re accented consonants that also need encoding—ZĒTA
 and SIGMA with DIALYTIKA (H-C/L-C), GAMMA with TILDAS, GAMMA; KAPPA; and
 KHI with OVERDOT, KAPPA; PI; TAU with TILDAS, LAMBDA; MU; NU with both
 PSILI and DASEIA, LAMBDA; MU; NU; and RHŌ with UNDERRING, ... .

 As far as Hebrew is concerned, we NEED these new characters encoded:

 WAW with a TRUE SHURUQ (the inner dot positioned a bit higher than a
 DAGHESH or a MAPPIQ)
 The same (above mentioned) WAW-TRUE SHURUQ with a DAGHESH added
 WAW with both a ḪOLAM atop and a DAGHESH inside
 Doubly-pointed SHIN letters—a plain one + one with a DAGHESH added
 MEM SOFITH with a right-positioned ḪIRIQ
 ḪAṬAFOTH vowel points—each with SILLUQ/METHEGH interjected within
 KHAF SOFITH and FEʾ SOFITH with RAFEH (especially for Yiddish)
 GHIMEL; DHALETH; and THAW with RAFEH
 CHIMEL; ĹAMEDH; and ÑUN with VARIQAʾ (especially for Ladino)
 BENT LAMEDH—plain, with ḪOLAM, with DAGHESH, and with both DAGHESH + ḪOLAM
 YUDH-WAW ligature
 GALGAL HAFUKH accent (especially for Yiddish)
 GIMEL; DALETH; ZAYIN; ṬETH; LAMEDH; NUN; SAMEKH; ʿAYIN; and REʾSH with
 GALGAL HAFUKH (for Yiddish palatal consonanats and the /e/ vowel sound)
 An assortment of letters with top dot configurations—single, double
 horizontal, triple up-triangular, and quadruple squared—for the typography
 required for miscellaneous Jewish languages, as these top-dotted letters
 are intended to imitate the ʾIJAM dots in the corresponding Arabic letters
 The Palestinian, Babylonian, and Yemenite systems of vowel pointing and
 cantillation.

 Please find the .PDF document on the polytonic Greek character codepoint
 listings; I’ll need to finish—and publish—a similar publication for Hebrew
 characters.  Thank You!





Re: Code point vs. scalar value

2013-09-20 Thread Mark Davis
Nicely stated.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Thu, Sep 19, 2013 at 11:21 PM, Whistler, Ken ken.whist...@sap.comwrote:

  Stephan Stiller seems unconvinced by the various attempts to explain the
 situation. Perhaps an authoritative explanation of the textual history
 might assist.

 ** **

 Stephan demands an answer:

 ** **

 I want to know why the Glossary claims that surrogate code points are
 [r]eserved for use by UTF-16.

 ** **

 Reason #1 (historical): Because the Glossary entry for “Surrogate Code
 Point” has been worded thusly since Unicode 4.0 (p. 1377), published in
 2003, and hasn’t been reworded since.

 ** **

 Reason #2 (substantive): Because UTC members have been satisfied with the
 content of the statement and have not required it be changed in subsequent
 versions of the standard.

 ** **

 Reason #3 (intentional): Because the wording was added in the first place
 as part of the change to identify the term “surrogate character”, which had
 been widely used before, as a misnomer and a usage to be deprecated. The
 term “surrogate code point” was a deliberate introduction at that time to
 refer specifically to the range U+D800..U+DFFF of “code points” which could
 *not* be used to encode abstract characters.

 ** **

 Reason #4 (proximal): Because nobody recently has submitted a suggested
 improvement to the text of the relevant entry in the glossary (and
 associated text in Chapter 3) which has passed muster in the editorial
 committee and been considered to be an improvement on the text.

 ** **

 If it is exegesis rather than textual history that concerns you, here is
 what I consider to be a full explanation of the meaning of the text that
 troubles you so:

 ** **

 Code points in the range U+D800..U+DFFF are reserved for a special
 purpose, and cannot be used to encode abstract characters (thereby making
 them encoded characters) in the Unicode Standard. Note that it is perfectly
 valid to refer to these as code points and use the U+ prefix for them. The
 U+ prefix identifies the Unicode codespace, and the glossary (correctly)
 identifies that as the range of integers from 0 to 10. O.k., if the
 range of code points U+D800..U+DFFF are reserved for a special purpose,
 what is that purpose and how do we designate the range? The designation is
 easy: we call elements of the subrange U+D800.. U+DBFF “high-surrogate code
 point” (see D71) and the elements of the subrange U+DC00..U+DFFF
 “low-surrogate code point” (see D73), and by construction (and common
 usage), the elements contained in the union of those two subranges is
 called “surrogate code point”. What is the special purpose? The shorthand
 description of the purpose is that the “surrogate code points” are “used
 for UTF-16”. But since that seems to confuse a minority of the readers of
 the standard, here is a longer explication: The surrogate code points are
 deliberately precluded from use to encode abstract characters to enable the
 construction of an efficient and unambiguous mapping between Unicode scalar
 values (the U+..U+D7FF, U+1..U+10 subranges of the Unicode
 codespace) and the sequences of 16-bit code units defined in the UTF-16
 encoding form. In other words, the reservation *from* encoding for the code
 points U+D800..U+DFFF enables the use of the numerical range 0xD800..0xDFFF
 to define surrogate pairs to map U+1..U+10, while otherwise
 retaining a simple one-to-one mapping from code point to code unit in
 UTF-16 for the BMP code points which *are* used for encoding abstract
 characters. In short, the surrogate code points are “used for UTF-16”.

 ** **

 Stephan’s next demand for an answer was:

 ** **

 Remind me real quick, in what way does a function use the input values
 that it's not defined on?

 ** **

 Well, the problem here is in the formulation of the implied question. I
 suspect, from the discussion in this thread, that Stephan has concluded
 that the generic wording “used for” in the glossary item in question
 necessary imputes that the surrogate code points are therefore elements of
 the domain of the mapping function for UTF-16 (which maps Unicode scalar
 values to sequences of UTF-16 code units). Of course that imputation is
 incorrect. Surrogate code points are excluded form that domain, by
 *definition*, as intended. And I have explained above what the phrase “used
 for” is actually used for in the glossary entry.

 ** **

 Finally:

 ** **

 And what does this have to do with UTF-16?

 ** **

 It is definitional for UTF-16. I think that should also be clear from the
 explanation above.

 ** **

 Now, rather than quibbling further about what the glossary says, if the
 explanation still does not satisfy, and if the text in the glossary (and in
 Chapter 3) still seems wrong and misleading in some way, here is a more
 productive way forward:

 

Re: Draft of LDML Specification for CLDR release 24

2013-09-13 Thread Mark Davis
Thanks for the feedback; the typo is fixed.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Sep 13, 2013 at 1:19 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 Typo in section 2.3 Number Symbols, for the new item
 superscriptingExponent which describes:
 The superscripting can use markup, such as sup4/sub in HTML, (...)

 Of course this is sup4/sup


 2013/9/13 John Emmons e...@us.ibm.com

 CLDR v24 is scheduled to be released next week (2013-09-18). While the
 LDML specification (*
 http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html*http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html)
 and release note 
 (*http://cldr.unicode.org/index/downloads/cldr-24*http://cldr.unicode.org/index/downloads/cldr-24)
 are still being worked on, we'd welcome feedback on any major problems in
 the text.

 A summary of the changes to specification can be found at:

*
http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Modifications
*http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Modifications




 Regards,

 John C. Emmons
 Globalization Architect  Unicode CLDR TC Chairman
 IBM Software Group
 Internet: e...@us.ibm.com





Re: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ

2013-08-05 Thread Mark Davis
 Classical Greek might qualify [for a CLDR entry]

It certainly qualifies, but we require that a submitter commit to
collecting a minimal amount of data before we add it. See
http://cldr.unicode.org/index/cldr-spec/minimaldata


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Aug 5, 2013 at 3:58 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:

  On 8/5/2013 11:26 AM, Whistler, Ken wrote:

 Inclusion of the precomposed characters now seen in the U+1FXX block was part 
 of the price of the merger. What was included was precisely the repertoire 
 requested by Greece, and no attempt was made to further rationalize forms 
 including macrons for Ancient Greek.

  Thanks, Ken. It's good to know that there is no other reason. Partial
 credit goes to Tom Gewecke who had pointed me off-list to
 http://www.tlg.uci.edu/~opoudjis/unicode/ken_adscripts.html
 and the fact that the precomposed set from ISO 10646 can be traced back to
 ELOT (ΕΛΟΤ).

  On 8/5/2013 1:25 PM, Richard Wordingham wrote:

 Classical Greek might qualify [for a CLDR entry]

  Yes or no, and I have in fact no(t yet an) opinion on the necessity
 thereof – it's a different question from the one to what extent D matters
 for A *if* A had an entry, but I think we're on the same page at this
 point:


 On 8/5/2013 1:25 PM, Richard Wordingham wrote:

 However, if vowels with macrons had made it into D, then one would expect 
 them in A.

  Yep, I agree. A loose analogy and one sensible view (which is in fact
 compatible with yours) is that it's imaginable for say a lexicographer for
 English to have some version of Cyrillic letters available for typesetting
 but defensible for him to not have/use stress marks, whereas any Cyrillic
 typesetting engine within a Cyrillic locale should be able to provide them.
 This made-up example is imperfect, but it might help someone understand the
 thread. That said, I have not yet formed an opinion on whether a font
 intended for a Modern Greek locale should be able to render ᾱ, ῑ, ῡ with
 additional diacritics. (One intended for Ancient Greek should, I think.)

 Stephan




Re: Behdad Esfahbod won an O'Reilly Open Source Award!

2013-07-29 Thread Mark Davis
Great news, and well deserved!

Congratulations, Behdad!


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jul 29, 2013 at 9:41 PM, Roozbeh Pournader rooz...@google.comwrote:

 Some of you probably have heard the news already, but in case you haven't,
 Behdad won the prestigious O'Reilly Open Source Award, announced last
 Friday.

 Here's the announcement:
 http://www.oscon.com/oscon2013/public/schedule/detail/29956

 Selected quotes:

 The O’Reilly Open Source Awards recognize individual contributors who
 have demonstrated exceptional leadership, creativity, and collaboration in
 the development of Open Source Software. [...]

 *Behdad Esfahbod (HarfBuzz):* Through the HarfBuzz project Behdad is
 working relentlessly to get all languages supported in Free Software
 operating systems, word processors, devices and browsers, no matter how
 complex their scripts are.

 I wish to congratulate Behdad for his achievements, which has really
 helped make open source way more accessible to billions of users around the
 world. I'm eagerly waiting for his amazing magic and superhacker skills to
 bear even more fruits over the years to come. I'm proud to have been able
 to call him a friend, colleague, and collaborator for more than fifteen
 years now.

 Roozbeh



Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Mark Davis
Popping up a level.

ICU (and some other libraries) have heuristic encoding detection, that will
take a sequence of bytes and come up with a likely encoding id.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken ken.whist...@sap.com wrote:



  Suppose that these hex bytes:
 
C3 83 C2 B1
 
  show up in a message and the message contains no hint what its encoding
 is.
 
  Perhaps it is 8859-1, in which case the message consists of four 1-byte
  characters:
 
  C3 = Ã
  83 = the “no break here” character
  C2 = Â
  B1 = ±
 
  Perhaps it is UTF-8, in which case the message consists of two 2-byte
  characters:
 
  C383 = 쎃
  C2B1 = 슱

 Actually, that would be interpreting it as UTF-16, not as UTF-8. That
 can probably be quickly ruled out if the rest of the text is not obviously
 in UTF-16.

 Interpreted as UTF-8, it would be:

 C3 83 -- U+00C3 = Ã
 C2 B1 -- U+00B1 = ±

 More likely than the other two alternatives you cite.

 Of course, you also have to consider serial corruptions as a possibility.

 It could have started out as UTF-8 C3 B1 -- U+00F1 = ñ.

 Then the C3 B1 got misinterpreted as Latin-1, and then re-misinterpreted
 as UTF-8 again.

 --Ken






Re: The skywriter we hired has terrible Unicode support

2013-05-08 Thread Mark Davis
Saw that, thanks!


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Wed, May 8, 2013 at 8:26 PM, Tim Greenwood timo...@greenwood.namewrote:

 http://xkcd.com/1209/



RE: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-20 Thread Mark Davis
LOL...

{phone}
On Apr 20, 2013 8:44 PM, Erkki I Kolehmainen e...@iki.fi wrote:

 Mr. Overington,

 I'm sorry to have to admit that I cannot follow at all your train of
 thought on what would be the practical value of localizable sentences in
 any of the forms that you are contemplating. In my mind, they would not
 appear to broaden the understanding between different cultures (and
 languages), quite the contrary. I appreciate the fact that there are
 several respectable members of this community who are far too polite to
 state bluntly what they think of the technical merits of your proposal.

 Sincerely, Erkki I. Kolehmainen

 -Alkuperäinen viesti-
 Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org]
 Puolesta William_J_G Overington
 Lähetetty: 20. huhtikuuta 2013 12:39
 Vastaanottaja: KenWhistler
 Kopio: unicode@unicode.org; KenWhistler; wjgo_10...@btinternet.com
 Aihe: Re: Encoding localizable sentences (was: RE: UTC Document Register
 Now Public)

 On Friday 19 April 2013, Whistler, Ken ken.whist...@sap.com wrote:

  You are aware of Google Translate, for example, right?

 Yes. I use it from time to time, mostly to translate into English: it is
 very helpful.

  If you input sentences such as those in your scenarios or the other
 examples, such as:

  Where can I buy a vegetarian meal with no gluten-containing ingredients
 in it please?

  You can get immediately serviceable and understandable translations in
 dozens of languages. For example:

  Wo kann ich ein vegetarisches Essen ohne Gluten-haltigen Bestandteile
 davon, bitte?

  Not perfect, perhaps, but perfectly comprehensible. And the application
 will even do a very decent job of text to speech for you.

 I am not a linguist and I know literally almost no German, so I am not
 able to assess the translation quality of sentences. Perhaps someone on
 this list who is a native speaker of German might comment please.

 I am thinking that the fact that I am not a linguist and that I am
 implicitly seeking the precision of mathematics and seeking provenance of a
 translation is perhaps the explanation of why I am thinking that
 localizable sentences is the way forward. There seems to a fundamental
 mismatch deep in human culture of the way that mathematics works precisely
 yet that translation often conveys an impression of meaning that is not
 congruently exact. Perhaps that is a factor in all of this.

 Thank you for your reply and for taking the time to look through the
 simulations and for commenting.

 Having read what you have written and having thought about it for a while
 I am wondering whether it would be a good idea for there to be a list of
 numbered preset sentences that are an international standard and then if
 Google chose to front end Google Translate with precise translations of
 that list of sentences made by professional linguists who are native
 speakers, then there could be a system that can produce a translation that
 is precise for the sentences that are on the list and machine translated
 for everything else.

 Maybe there could then just be two special Unicode characters, one to
 indicate that the number of a preset sentence is to follow and one to
 indicate that the number has finished.

 In that way, text and localizable sentences could still be intermixed in a
 plain text message. For me, the concept of being able to mix text and
 localizable sentences in a plain text message is important. Having two
 special characters of international standard provenance for denoting a
 localizable sentence markup bubble unambiguously in a plain text document
 could provide an exact platform. If a software package that can handle
 automated localization were active then it could replace the sequence with
 the text of the sentence localized into the local language: otherwise the
 open localizable sentence bubble symbol, some digits and the close
 localizable sentence bubble symbol would be displayed.

 If that were the case then there might well not be symbols for the
 sentences, yet the precise conveying of messages as envisaged in the
 simulations would still be achievable.

 Perhaps that is the way forward for some aspects of communication through
 the language barrier.

 Another possibility would be to have just a few localizable sentences with
 symbols as individual characters and to have quite a lot of numbered
 sentences using a localizable sentence markup bubble and then everything
 else by machine translation.

 I shall try to think some more about this.

  At any rate, if Margaret Gattenford and her niece are still stuck at
 their hotel and the snow is blocking the railway line, my suggestion would
 be that Margaret whip out her mobile phone. And if she doesn't have one,
 perhaps her niece will lend hers to Margaret.

 Well, they were still staying at the hotel were some time ago.

 They feature in locse027_simulation_five.pdf available from the following
 post.

 

Re: Rendering Raised FULL STOP between Digits

2013-03-10 Thread Mark Davis
Should the Unicode Consortium decide to recommend an existing (or new)
character as a raised decimal for numbers, we would add that to CLDR, and
recommend that implementations accept either one as equivalent when parsing.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sun, Mar 10, 2013 at 10:39 AM, Richard Wordingham 
richard.wording...@ntlworld.com wrote:

 On Sat, 9 Mar 2013 18:58:45 -0700
 Doug Ewell d...@ewellic.org wrote:

  Richard Wordingham wrote:

   The general feeling seems to be that computers don't do proper
   decimal points, and so the raised decimal point is dropping out of
   use.

  Any discussion of whether computers handle decimal points properly
  can't happen without talking about number-to-string conversion
  routines in programming languages and frameworks.

 The question is what users will demand. Expectations have been low
 enough that the loss of decimal points has been accepted.
 Additionally, striving for an apparently hard to get raised decimal
 point risks being forced to use an achievable decimal comma.

  Conversion routines are often able to choose between full stop and
  comma as the decimal separator, based on locale, but I'm not aware of
  any that will use U+00B7.

  The same is true for using U+2212, or even U+2013, as the negative
  sign instead of U+002D, which looks just terrible for this purpose in
  many fonts.

 U+2212 is not necessary for English (see CLDR exemplar characters), so
 CLDR policy (if not rules) do not allow it in CLDR conversion rules.
 I'm feeling lucky that I've got away with using it in documents for a
 few years now, but may be I've only succeeded because we've been cut and
 pasting from a Unicode-aware environment (Windows) to an 8-bit
 environment (ill-maintained Solaris, hated by management).

 Richard.




Re: JSON version of CLDR

2013-03-03 Thread Mark Davis
I think just the main data is converted. If you want to request the other
data you can file a cldr ticket.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sat, Mar 2, 2013 at 8:35 PM, Edwin Hoogerbeets ehoogerbe...@gmail.comwrote:

 Hi all, I am trying to find the CLDR collation tailoring and DUCET data
 in JSON format. I looked at the CLDR data published for release 22.1 (
 http://www.unicode.org/repos/cldr-aux/json/22.1/), but it doesn't seem
 to be there. Is this the right place to look for that? (Is it even
 converted to JSON format yet?)

 Thanks,

 Edwin






Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis
 But still non-conformant.

That's incorrect.

The point I was making above is that in order to say that something is
non-conformant, you have to be very clear what it is non-conformant *TO*
.

 Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such
 (e.g., in collation).

   - That *is* conformant for *Unicode 16-bit strings.*
   - That is *not* conformant for *UTF-16*.

There is an important difference.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 10:48 AM, Doug Ewell d...@ewellic.org wrote:

 But still non-conformant.


Re: Are there Unicode processors?

2013-01-07 Thread Mark Davis
That is not the typical way that Unicode text is processed.

Typically whatever OS you are using will supply mechanisms for iterating
through any Unicode string, returning each of the code points. It may also
offer APIs for returning information about each character (called 'property
values', or you can get libraries like ICU (http://site.icu-project.org/)
that have full-featured property support (
http://userguide.icu-project.org/strings/properties).


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 2:34 PM, Costello, Roger L. coste...@mitre.orgwrote:

 Hi Folks,

 An XML processor breaks up an XML  document into its parts -- here's a
 start tag, here's element content, here's an end tag, etc. -- and then
 makes those parts (along with information about each part such as this
 part is a start tag and this part is element content) available to XML
 applications via an API.

 Are there Unicode processors?

 That is, are there processors that break up Unicode text into its parts --
 here's a character, here's another character, here's still another
 character, etc. -- and then makes those parts (along with information about
 each part such as this part is the Latin Capital Letter T and this part
 is the Latin Small Letter o) available to Unicode applications (such as
 XML processors) via an API?

 I did a Google search for Unicode processor and came up empty so I am
 guessing the answer is that there are no Unicode processors. Or perhaps
 they go by a different name? If there are no Unicode processors, why not?

 /Roger





Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis
That's not the point (see successive messages).


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 4:59 PM, Martin J. Dürst due...@it.aoyama.ac.jpwrote:

 On 2013/01/08 3:27, Markus Scherer wrote:

  Also, we commonly read code points from 16-bit Unicode strings, and
 unpaired surrogates are returned as themselves and treated as such (e.g.,
 in collation). That would not be well-formed UTF-16, but it's generally
 harmless in text processing.


 Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.

 Regards,   Martin.




Re: What does it mean to not be a valid string in Unicode?

2013-01-07 Thread Mark Davis
In practice and by design, treating isolated surrogates the same as
reserved code points in processing, and then cleaning up on conversion to
UTFs works just fine. It is a tradeoff that is up to the implementation.

It has nothing to do with a legacy of C pointer arithmetic. It does
represent a pragmatic choice some time ago, but there is no need getting
worked up about it. Human scripts and their representation on computers is
quite complex enough; in the grand scheme of things the handling of
surrogates in implementations pales in significance.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  Things like this are called garbage in, garbage-out (GIGO). It may be
 harmless, or it may hurt you later.

 So in this kind of a case, what we are actually dealing with is: garbage
 in, principled, correct results out. ;-)


 Wouldn't the clean way be to ensure valid strings (only) when they're
 built and then make sure that string algorithms (only) preserve
 well-formedness of input?

 Perhaps this is how the system grew, but it seems to be that it's
 yet another legacy of C pointer arithmetic and
 about convenience of implementation
 rather than a
 safety or
 performance
 issue.

 Stephan





Re: What does it mean to not be a valid string in Unicode?

2013-01-06 Thread Mark Davis
Some of this is simply historical: had Unicode been designed from the start
with 8 and 16 bit forms in mind, some of this could be avoided. But that is
water long under the bridge. Here is a simple example of why we have both
UTFs and Unicode Strings.

Java uses Unicode 16-bit Strings. The following code is copying all the
code units from string to buffer.

StringBuilder buffer = new StringBuilder();
for (int i = 0; i  string.length(); ++i) {
  buffer.append(i.charAt(i));
}

If Java always enforced well-formedness of strings, then

   1. The above code would break, since there is an intermediate step where
   buffer is ill-formed (when just the first of a surrogate pair has been
   copied).
   2. It would involve extra checks in all of the low-level string code,
   with some impact on performance.

Newer implementations of strings, such as Python's, can avoid these issues
because they use a Uniform Model, always dealing in code points. For more
information, see also
http://macchiati.blogspot.com/2012/07/unicode-string-models-many-programming.html

(There are many, many discussions of this in the Unicode email archives if
you have more questions.)


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sat, Jan 5, 2013 at 11:14 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


 If for example I sit on a committee that devises a new encoding form, I
 would need to be concerned with the question which *sequences of Unicode
 code points* are sound. If this is the same as sequences of Unicode
 scalar values, I would need to exclude surrogates, if I read the standard
 correctly (this wasn't obvious to me on first inspection btw). If for
 example I sit on a committee that designs an optimized compression
 algorithm for Unicode strings (yep, I do know about SCSU), I might want to
 first convert them to some canonical internal form (say, my array of
 non-negative integers). If U+surrogate values can be assumed to not
 exist, there are 2048 fewer values a code point can assume; that's good for
 compression, and I'll subtract 2048 from those large scalar values in a
 first step. Etc etc. So I do think there are a number of very general use
 cases where this question arises.


 In fact, these questions have arisen in the past and have found answers
 then. A present-day use case is if I author a programming language and need
 to decide which values for val I accept in a statement like this:
 someEncodingFormIndependentUnicodeStringType str = val, specified in
 some PL-specific way

 I've looked at the Standard, and I must admit I'm a bit perplexed. Because
 of C1, which explicitly states

 A process shall not interpret a high-surrogate code point or a
 low-surrogate code point as an abstract character.

 I do not know why surrogate values are defined as code points in the
 first place. It seems to me that surrogates are (or should be) an encoding
 form–specific notion, whereas I have always thought of code points as
 encoding form–independent. Turns out this was wrong. I have always been
 thinking that code point conceptually meant Unicode scalar value, which
 is explicitly forbidden to have a surrogate value. Is this only
 terminological confusion? I would like to ask: Why do we need the notion of
 a surrogate code point; why isn't the notion of surrogate code units [in
 some specific encoding form] enough? Conceptually surrogate values are
 byte sequences used in encoding forms (modulo endianness). Why would one
 define an expression (Unicode code point) that conceptually lumps
 Unicode scalar value (an encoding form–independent notion) and surrogate
 code point (a notion that I wouldn't expect to exist outside of specific
 encoding forms) together?

 An encoding form maps only Unicode scalar values (that is all Unicode code
 points excluding the surrogate code points), by definition. D80 and what
 follows (Unicode string and Unicode X-bit string) exist, as I
 understand it, *only* in order for us to be able to have terminology for
 discussing ill-formed code unit sequences in the various encoding forms;
 but all of this talk seems to me to be encoding form–dependent.

 I think the answer to the question I had in mind is that the legal
 sequences of Unicode scalar values are (by definition)
 ({U+, ..., U+10} \ {U+D800, ..., U+DFFF})* .
 But then there is the notion of Unicode string, which is conceptually
 different, by definition. Maybe this is a terminological issue only. But is
 there an expression in the Standard that is defined as sequence of Unicode
 scalar values, a notion that seems to me to be conceptually important? I
 can see that the Standard defines the various well-formed encoding form
 code unit sequence. Have I overlooked something?

 Why is it even possible to store a surrogate value in something like the
 icu::UnicodeString datatype? In other words, why are we concerned with
 storing Unicode *code points* in data structures instead 

Re: If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?

2013-01-06 Thread Mark Davis
There are many cases of such digraphs.

Example from Slovak:

c  d  h
but
cd  h  ch

Cf http://www.unicode.org/reports/tr10/, searching for Slovak.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Sun, Jan 6, 2013 at 1:56 PM, Costello, Roger L. coste...@mitre.orgwrote:

 Hi Folks,

 In the book, Unicode Demystified (p. xxii) it says:

 An English-speaking  programmer might assume,
 for example, that given the three characters X, Y,
 and Z, that if X sorts before Y, then XZ sorts before
 YZ. This works for English, but fails for many
 languages.

 Would you give an example of where character 1 sorts before character 2
 but character 1, character 3 does not sort before character 2, character 3?

 /Roger





Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Mark Davis
http://www.unicode.org/alloc/CurrentAllocaiton.html
=
http://www.unicode.org/alloc/CurrentAllocation.html


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jan 4, 2013 at 10:24 AM, Whistler, Ken ken.whist...@sap.com wrote:

 Stephan Stiller continued:

  Occasionally the question is asked how many characters Unicode has. This
  question has an answer in section D.1 of the Unicode Standard. I
  suspect, however, that once in a while the motivation for asking this
  question is to find out how much of Unicode has been used up. As
  filling in holes would be dispreferred, it might be interesting to know
  how much of Unicode has been filled if one counts partially filled
  blocks as full. I have no reason to disagree with the (stated and
  reiterated) opinion that our codespace won't be used up in the
  foreseeable future, but it's simply a fun question to ask.
 

 The editors maintain some statistical information relevant to this fun
 question at:

 http://www.unicode.org/alloc/CurrentAllocaiton.html

 Feel free to reference those fun facts the next time Unicode comes up in
 conversation at the bar. ;-)

 There have been a few notable examples where particularly egregious
 examples of holes in blocks that seemed unlikely to be filled with like
 material in the future were reprogrammed as it were, and grabbed for the
 encoding of unrelated sets of characters. The most notable of these is the
 range U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A
 block. There was a clear consensus in both committees that nobody wanted to
 add any more encodings for presentation forms of Arabic ligatures. So, when
 a need arose to add another range of noncharacters, the UTC simply decided
 that the otherwise unused range U+FDD0..U+FDEF could serve for that, while
 not requiring the addition of a new two-column block that could otherwise
 be used on the BMP. There are several symbol blocks on the BMP which have
 also had a somewhat colorful and creative history of hole-filling over
 time.

 --Ken






Re: What does it mean to not be a valid string in Unicode?

2013-01-04 Thread Mark Davis
To assess whether a string is invalid, it all depends on what the string is
supposed to be.

1. As Ken says, if a string is supposed to be in a given encoding form
(UTF), but it consists of an ill-formed sequence of code units for that
encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in
UTF-16 or any surrogate (eg 0xD800) in UTF-32 would make the string
invalid. For example, a Java String may be an invalid UTF-16 string. See
http://www.unicode.org/glossary/#unicode_encoding_form

2. However, a Unicode X-bit string does not have the same restrictions:
it may contain sequences that would be ill-formed in the corresponding UTF-X
encoding form. So a Java String is always a valid Unicode 16-bit string.
See http://www.unicode.org/glossary/#unicode_string

3. Noncharacters are also valid in interchange, depending on the sense of
interchange. The TUS says In effect, noncharacters can be thought of as
application-internal private-use code points. If I couldn't interchange
them ever, even internal to my application, or between different modules
that compose my application, they'd be pointless. They are, however,
strongly discouraged in *public* interchange. The glossary entry and some
of the standard text is a bit old here, and needs to be clarified.

4. The quotation we select a substring that begins with a combining
character, this new string will not be a valid string in Unicode. is
wrong. It *is* a valid Unicode string. It isn't particularly useful in
isolation, but it is valid. For some *specific purpose*, any particular
string might be invalid. For example, the string mark#d might be invalid in
some systems as a password, where # is disallowed, or where passwords might
be required to be 8 characters long.




Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller
stephan.stil...@gmail.comwrote:


  A Unicode string in UTF-8 encoding form could be ill-formed if the bytes
 don't follow the specification for UTF-8, for example.

 Given that answer, add in UTF-32 to my email just now, for simplicity's
 sake. Or let's simply assume we're dealing with some sort of sequence of
 abstract integers from hex+0 to hex+10, to abstract away from encoding
 form issues.

 Stephan





Re: locale-aware string comparisons

2013-01-02 Thread Mark Davis
Agreed.

FYI, for those interested, here is the data file I generated with the
approaches A, B, C as discussed.

https://docs.google.com/a/google.com/spreadsheet/pub?key=0AqRLrRqNEKv-dGk0RHVoQWN6OGw1TVFNOWRaMEJfWEEgid=0


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Wed, Jan 2, 2013 at 11:07 AM, Shawn Steele shawn.ste...@microsoft.comwrote:

 I'd try to avoid making a dependency where case mapping needs to be the
 same as case insensitive comparisons.

 I'd either always case fold then compare, or always compare case
 insensitive.

 -Shawn

 -Original Message-
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
 Behalf Of James Cloos
 Sent: Tuesday, January 1, 2013 5:43 PM
 To: Mark Davis ☕
 Cc: Whistler, Ken; unicode@unicode.org
 Subject: Re: locale-aware string comparisons

  MD == Mark Davis ☕ m...@macchiato.com writes:

 MD All of these are different, all of them still have over 200
 MD differences from either compare(lower(x),lower(y)) or compare(upper
 MD (x),upper(y))

 What about, then:

   compare(lower(x),lower(y)) || compare(upper(x),upper(y))

 Or, to emphasize that I mentioned C only as a pseudocode, akin to SQL:

   LOWER(x) LIKE LOWER(y) OR UPPER(x) LIKE UPPER(y)

 Would that cover all of the outliers?

 -JimC
 --
 James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6






Re: locale-aware string comparisons

2013-01-01 Thread Mark Davis
 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR

James,
Even without locale differences, the situation is a bit tricky. Assuming
that str_tolower() and str_toupper() were straightforwardly defined in
terms of the (full) Unicode case mappings, there is still the issue that
the DUCET does not define a caseless compare. It puts case together with
other variants into a set of Level 3 data. There are 3 approaches one can
take with a strcasecmp() straightforwardly based on LDML. I generated some
numbers for these with a quick test program, but note that they use the
CLDR root locale, which has a few changes from DUCET.

A. Define it to be just comparing after Unicode case folding.

B. Use DUCET and only compare according to Level 1  2. That ignores case,
but also some other features.

C. Use the case level as defined in LDML, plus Levels 1  2.

All of these are different, all of them still have over 200 differences
from either compare(lower(x),lower(y)) or compare(upper(x),upper(y)) These
are mostly because special weighting of compatibility variants, or of the
Greek iota subscript. Example:

s  ſ, but upper( s ) = upper( ſ ) // LATIN SMALL LETTER S vs LATIN SMALL
LETTER LONG S




Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Mon, Dec 31, 2012 at 3:29 PM, Whistler, Ken ken.whist...@sap.com wrote:

 Well, in answering the question which was actually posed here:

 1. ISO/IEC 10646 has absolutely nothing to say about this issue, because
 10646 does not define case mapping at all.

 2. The Unicode Standard *does* define case mapping, of course, as well as
 case folding. The relevant details are in Section 3.13 of the standard,
 supported by various data files in the Unicode Character Database. TUS 6.2,
 Section 3.13, p. 117, does define toUpperCase(X) and toLowerCase(X), but
 those are string mapping operations, not directly comparable to Linux (and
 in general Unix) toupper() and tolower(), which are character mapping
 functions. The closer correlates to Linux toupper() and tolower() are
 Unicode's definitions of Uppercase_Mapping(C) and Lowercase_Mapping(C).
 However, there is a significant difference lurking, in that the Unicode
 case mapping definitions are not locale-sensitive. The full case mappings
 do include two conditional sets of mappings (from SpecialCasing.txt) for
 Lithuanian and for Turkish and Azeri, mostly affecting the behavior of the
 dot on i, but the use of those conditional mappings depends on the
 availability of explicit language context.

 This contrasts with the Linux (and in general Unix) toupper() and
 tolower() functions, which in principle, at least, are locale-sensitive,
 depending on the current locale setting, and in particular on whether the
 LC_CTYPE category in the locale has a non-null list of mappings for toupper
 and/or tolower in it.

 Perhaps even more importantly, the Unicode Standard does not state
 anything regarding the details of the behavior of the APIs strcasecmp() or
 tolower() or toupper() in libc. Those are the concerns of the C and POSIX
 specs, not the Unicode Standard. Nor could the Unicode Standard really get
 involved in this, precisely because  that behavior involves locales, and
 locales are outside the scope of the Unicode Standard.

 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may
 have to jump in here, but while locales clearly *are* in the scope of LDML
 and CLDR, there is currently little if anything they have to say about
 specific case mapping rules.

 As regards the particulars of the question, I suspect that it would depend
 in part on how strcasecmp(), str_tolower() and str_toupper() are
 implemented (I am assuming string conversions APIs here based on the
 tolower() and toupper() APIs), but there probably *are* instances where the
 results would diverge. The most likely source of trouble would be Turkish
 case mapping. In particular, if you compare U+0130 LATIN CAPITAL LETTER I
 WITH DOT ABOVE to a canonically equivalent sequence of U+0049, U+0307,
 there may be conundrums. If strcasecmp() is implemented based on Turkish
 case folding, then strcasecmp( U+0130, U+0049, U+0307 ) == 0. If
 str_tolower() is based on Turkish case mapping, then str_tolower( U+0130 )
 == U+0069, U+0307, so strcmp(str_tolower( U+0130), str_ tolower(
 U+0049,U+0307 ) == 0, *but* str_toupper( U+0130 ) == U+0130 and
 str_toupper( U+0049,U+0307 ) == U+0049,U+0307, so strcmp(str_toupper(
 U+0130 ), str_toupper( U+0049,U+0307 ) != 0. The two upperc!
  ased versions are *canonically* equivalent, but you wouldn't expect a
 strcmp() operation to be checking normalization of strings. So unless the
 implementations of str_tolower() and str_ toupper() were doing canonical
 normalization as well as case mapping, you could indeed find some odd edge
 cases for Turkish casing, at least.

 --Ken

  Given (just) the data in 10646, Unicode and cldr, are there any locales
  where a 

Re: Character name translations

2012-12-20 Thread Mark Davis
There are different use cases, and I think they are getting confused.

1. Present a name for each character, some sort of formal name.
I think this is probably the least useful for average users.

2. Allow searching for characters, eg in a character picker.
Sample use case: search for dash (or the equivalent in Georgian) and get
the dashes.

3. Provide disambiguating information about a character (to distinguish
from visually similar characters).
Sample use case: have a hover over a character and show em dash vs en
dash  (or the equivalent in Georgian).

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


On Thu, Dec 20, 2012 at 8:18 AM, Asmus Freytag asm...@ix.netcom.com wrote:

 In my other message, I made clear that I think translations of just the
 names is a lot less useful than translation of the full information
 presented in the code charts, which includes block (and therefore script)
 names, annotations and listing of alternate names by which these characters
 are known to ordinary users.



Some much-needed improvements in JavaScript i18n

2012-12-19 Thread Mark Davis
I have a new google blog post about the new ECMAScript (JavaScript)
internationalization spec.

“Until now, it has been very difficult for web application designers to do
something as simple as sort names correctly according to the user's
language. And it matters: English readers wouldn’t expect Århus to sort
below Zürich, but Danish speakers would.” …

http://googledevelopers.blogspot.com/2012/12/putting-zurich-before-arhus.html

Many people contributed to this multi-year effort!

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**


Re: Question about normalization tests

2012-12-10 Thread Mark Davis
0300 *is* blocked, because there is a preceding character (0305) that has
the same combining class (230).

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Dec 10, 2012 at 11:55 AM, Edwin Hoogerbeets
ehoogerbe...@gmail.comwrote:

 Looking at 0300, it is also not blocked from 0061, so check the primary
 composition for 0061 0300. There is a primary composition for that
 sequence, 00E0, so replace the starter with that, delete the 0300, and
 continue. The string looks like this now:



Re: io9 describes Unicode as one of the 10 most unlikely things influenced by J.R.R. Tolkien

2012-12-08 Thread Mark Davis
 Their inference, it appears, is that had I not read Tolkien when I was 13
I would not be who I am today and the content of the Universal Character
Set might be a lot different than it is.

I doubt it.

Many people are far more responsible for the structure, model, properties,
and characters of Unicode, including not only those who belong to the
Unicode consortium, but also those in the IRG, those in ISO, and those who
originally developed the international, national, and vendor encoding
standards that Unicode built upon.

Unicode characters, measured by frequency of usage on the web, would be
essentially the same had Michael not been around. That would not be the
case without people like Ken Whistler, Joe Becker, Lee Collins, Lisa Moore,
Michel Suignard, or Asmus Freytag: I could go on, but there are far to many
to name. Nor would Unicode have been a success without the many people who
worked in different companies to build the infrastructure necessary for its
use, or the staff behind the scenes working in the Unicode Consortium.

Michael has made many valuable contributions to Unicode, especially for
minority and historic scripts. And he can be rightfully proud of the work
he has done there. But neither should that work be exaggerated.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Sat, Dec 8, 2012 at 2:56 AM, Michael Everson ever...@evertype.comwrote:

 On 8 Dec 2012, at 10:07, Shriramana Sharma samj...@gmail.com wrote:

  Well nice to hear, and of course you have contributed a lot to Unicode!
 
  But I fail to see the logical connection between Unicode as a technical
 standard and Tolkien! I hadn't heard about this website, but if they
 purport to write on science, but make such illogical deductions, I am not
 sure I'll be reading it much in future.

 Their inference, it appears, is that had I not read Tolkien when I was 13
 I would not be who I am today and the content of the Universal Character
 Set might be a lot different than it is.

 Michael Everson * http://www.evertype.com/






Re: StandardizedVariants.txt error?

2012-11-26 Thread Mark Davis
I agree with that analysis.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Nov 26, 2012 at 1:53 PM, Whistler, Ken ken.whist...@sap.com wrote:

 Actually, I think the omission here is the word canonical. In other
 words, Section 16.4 should probably read:

 The base character in a variation sequence is never a combining character
 or a *canonical* decomposable character.

 Note that with this addition, StandardizedVariants.txt poses no
 contradiction, because all of the decomposable character instances noted
 are compatibility decomposable characters.

 The main concern here with this restriction is to ensure that one doesn't
 end up with conundrums involving canonical decompositions into sequences
 followed by a variation selector.

 In the case of compatibility decompositions, there already is no
 expectation that neither the appearance nor the interpretation of the text
 will change. With a decomposition mapping like font 0069, the
 decomposition is already indicating a typically different appearance. If
 you decompose U+2139 to U+0069, you have already lost information about
 appearance and interpretation. So it isn't that much of a stretch to assume
 that any relevant variation sequences will also lose their interpretation.

 But I think it might make sense, in addition to the above textual fix, to
 add a note to the standard to indicate that variation sequences preserve
 their validity across *canonical* normalization forms, but that there is no
 guarantee that variation sequences will remain valid for any compatibility
 normalization.

 --Ken

  2012-11-24 8:12, Masatoshi Kimura wrote:
 
   According to TUS v6.2 clause 16.4,
   http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf#page=15
   The base character in a variation sequence is never a
   combining character or a decomposable character.
   However, the following base characters appearing in
   http://unicode.org/Public/6.2.0/ucd/StandardizedVariants.txt
   have a decomposition mapping.
 
  There seems to be a contradiction here. “Decomposable character” is
  defined in clause 3.7 as follows:
 
  “A character that is equivalent to a sequence of one or more other
  characters, according to the decomposition mappings found in the Unicode
  Character Database, and those described in Section 3.12, Conjoining Jamo
  Behavior.”
 
  I suppose the intended meaning in clause 16.4, given its context, is to
  say that the base character is neither a combining character nor a
  character with a decomposition that contains a combining character.
 
  Yucca
 
 






Re: Caret

2012-11-12 Thread Mark Davis
 This case remains very infrequent: it is extremely rare to start typing
text in

With arrow keys or mouse clicking it is more frequent to end up on a
directional boundary.


Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Mon, Nov 12, 2012 at 1:47 PM, Asmus Freytag asm...@ix.netcom.com wrote:

  On 11/12/2012 1:27 PM, Khaled Hosny wrote:

 I’m not sure from where you are getting your statistics, but I’ve to
 deal with all those “rare” and “extremely rare” situations all the day.

  Khaled, don't mind Philippe - his experience is a bit on the
 theoretical end.

 A./



Re: Character set cluelessness

2012-10-02 Thread Mark Davis
I tend to agree. What would be useful is to have one column for the city in
the local language (or more columns for multilingual cities), but it is
extremely useful to have an ASCII version as well.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne jonathan.rose...@gmail.com
 wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

 ** **

 Jony

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *john knightley
 *Sent:* Tuesday, October 02, 2012 6:35 PM
 *To:* Doug Ewell
 *Cc:* unicode@unicode.org; loc...@unece.org
 *Subject:* Re: Character set cluelessness

 ** **

 Sad to say this seems to be close to the norm for all to many large
 organizations where if it isn't in the 1990's version of the Times Roman
 font then it's out. 

 John

 On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­


 



Re: Character set cluelessness

2012-10-02 Thread Mark Davis
Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote:

 I tend to agree. What would be useful is to have one column for the city
 in the local language (or more columns for multilingual cities), but it is
 extremely useful to have an ASCII version as well.

 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne 
 jonathan.rose...@gmail.com wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

 ** **

 Jony

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *john knightley
 *Sent:* Tuesday, October 02, 2012 6:35 PM
 *To:* Doug Ewell
 *Cc:* unicode@unicode.org; loc...@unece.org
 *Subject:* Re: Character set cluelessness

 ** **

 Sad to say this seems to be close to the norm for all to many large
 organizations where if it isn't in the 1990's version of the Times Roman
 font then it's out. 

 John

 On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­


 





Re: Character set cluelessness

2012-10-02 Thread Mark Davis
And just to be clear, I do agree that their documentation of the standards
usage, well, needs improvement. I'm just talking about the actual data, and
for that as a practical matter it is valuable to have both the native
language version(s) of a name, and a Latin equivalent.

Mark https://plus.google.com/114199149796022210033
*
*
*— Il meglio è l’inimico del bene —*
**



On Tue, Oct 2, 2012 at 2:52 PM, Mark Davis ☕ m...@macchiato.com wrote:

 Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm

 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ m...@macchiato.com wrote:

 I tend to agree. What would be useful is to have one column for the city
 in the local language (or more columns for multilingual cities), but it is
 extremely useful to have an ASCII version as well.

 Mark https://plus.google.com/114199149796022210033
 *
 *
 *— Il meglio è l’inimico del bene —*
 **



 On Tue, Oct 2, 2012 at 1:23 PM, Jonathan Rosenne 
 jonathan.rose...@gmail.com wrote:

 I don't agree with the criticism. These place name are there to be
 readable by a wide audience, rather than writable by locals and
 specialists. They require the lowest common denominator.

 ** **

 Jony

 ** **

 *From:* unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] *On
 Behalf Of *john knightley
 *Sent:* Tuesday, October 02, 2012 6:35 PM
 *To:* Doug Ewell
 *Cc:* unicode@unicode.org; loc...@unece.org
 *Subject:* Re: Character set cluelessness

 ** **

 Sad to say this seems to be close to the norm for all to many large
 organizations where if it isn't in the 1990's version of the Times Roman
 font then it's out. 

 John

 On 3 Oct 2012 00:26, Doug Ewell d...@ewellic.org wrote:

 The United Nations Economic Commission for Europe (UNECE) has released a
 new version of UN/LOCODE, and their Secretariat Note document is just as
 clueless as ever about character set usage in international standards:

 Place names in UN/LOCODE are given in their national language versions
 as expressed in the Roman alphabet using the 26 characters of the
 character set adopted for international trade data interchange, with
 diacritic signs, when practicable (cf. Paragraph 3.2.2 [sic; should be
 3.3.2] of the UN/LOCODE Manual). International ISO Standard character
 sets are laid down in ISO 8859-1 (1987) and ISO10646-1 (1993). (The
 standard United States character set (437), which conforms to these ISO
 standards, is also widely used in trade data interchange).

 It's 2012. How does one get through to folks like this? I tried writing
 to them a few years ago, but I don't think they were impressed by an
 individual contribution.

 http://www.unece.org/cefact/locode/welcome.html

 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­


 






Re: Announcing The Unicode Standard, Version 6.2

2012-09-26 Thread Mark Davis
BTW, if you want to share the announcement:

   - Google+:
   https://plus.sandbox.google.com/u/0/109412260435993059737/posts (I also
   reposted at with my personal
accounthttps://plus.google.com/114199149796022210033
   .)
   - Facebook:
   http://www.facebook.com/pages/Friends-of-Unicode/127785250588285
   - Twitter: http://twitter.com/unicode/

Mark
**


On Wed, Sep 26, 2012 at 1:06 PM, announceme...@unicode.org wrote:

 **

 Version 6.2 of the Unicode Standard is now available. This version adds
 only a single character, the newly adopted Turkish Lira sign; however, the
 properties and behaviors for many other characters have been adjusted.
 Emoji and pictographic symbols now have significantly improved
 line-breaking, word-breaking and grapheme cluster behaviors. The script
 categorizations for some characters are improved and better documented.

 The Unicode Collation Algorithm has been greatly enhanced for Version 6.2,
 with a major overhaul of its documentation. There have also been
 significant changes to the collation weight tables, including improved
 handling of tertiary weights for characters with decompositions, and
 changed weights for some pictographic symbols.

 The newly encoded Turkish Lira sign, like other currency symbols, is
 expected to be heavily used in its target environment. The Unicode
 Consortium accelerated the release of Unicode 6.2, to accommodate the
 urgent need for this character.

 For more details of this release, see
 http://www.unicode.org/versions/Unicode6.2.0/.

TurkishLira75pct.jpg

<    1   2   3   4   5   6   7   8   9   10   >