Re: Names for control characters (Was: "(in 6429)" in allkeys.txt)

2014-03-12 Thread Mark Davis
They do have aliases in NameAliases.txt ;NULL;control ;NUL;abbreviation 0001;START OF HEADING;control 0001;SOH;abbreviation 0002;START OF TEXT;control 0002;STX;abbreviation ... Mark *— Il meglio è l’inimico del bene —* On Wed, Mar 12, 2014 at 1:3

Re: NFD -> NFC

2014-03-11 Thread Mark Davis
Not sure about your exact case, but ICU's normalization does handle those characters. http://unicode.org/cldr/utility/transform.jsp?a=nfc%3Bhex&b=%5Cu30B9%5Cu3099 (That tool uses ICU for NFC). Mark *— Il meglio è l’inimico del bene —* On Tue, Mar 11, 2014 at 4

Re: Unicode organization is still anti-Serbian and anti-Macedonian

2014-02-14 Thread Mark Davis
Unicode is not anti-Serbian or Macedonian. The exact level of Unicode support will depend on your operating system and font choice. For example, on the Mac there are reasonable results with arbitrary accents. Here are examples with and q̈ Q̈ Here is an image, in case your emailer or OS doesn'

Re: CJK IDS database

2014-01-14 Thread Mark Davis
Boy, I'd forgotten about those. There is an open-source collection of IDSs that I used to create those files. Unfortunately, I found that *that* data would take a lot of cleanup. I do agree that it would be very useful to have an open-source repository of IDSs for Unicode characters, but I don't k

Language Death

2013-12-05 Thread Mark Davis
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0077056 with a popular article at http://www.washingtonpost.com/blogs/worldviews/wp/2013/12/04/how-the-internet-is-killing-the-worlds-languages/ The source article was interesting, although I'd take issue with some of their methodology.

Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Mark Davis
Nov 1, 2013 at 1:36 PM, Philippe Verdy wrote: > > > 2013/11/1 Mark Davis ☕ > >> These are two well-known serious flaws in EAI and URLs; there is no >> useful syntactic limit on what is in the query part of a URL or on the >> local part of an email address that would

Re: Best practice of using regex on identify none-ASCII email address

2013-11-01 Thread Mark Davis
These are two well-known serious flaws in EAI and URLs; there is no useful syntactic limit on what is in the query part of a URL or on the local part of an email address that would allow their boundaries to be detected in plaintext. No use complaining about them, because people are concerned with

Re: full-width Latin missing from confusables data

2013-10-29 Thread Mark Davis
3> * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 15, 2013 at 8:53 PM, Mark Davis ☕ wrote: > > but as Michel mentioned the data > does not seem consistent in that case. > ​ > > You might add that to your report​... > > > > Mark <https://plus.google.com

Re: Terminology question re ASCII

2013-10-28 Thread Mark Davis
Normally the term ASCII just refers to the 7-bit form. What is sometimes called "8-bit ASCII" is the same as ISO Latin 1. If you want to be completely clear, you can say "7-bit ASCII". Mark * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2

Re: full-width Latin missing from confusables data

2013-10-15 Thread Mark Davis
013 12:40 AM, Mark Davis ☕ wrote: > > For the confusables, the presumption is that implementations have > > already either normalized the input to NFKC or have rejected input that > > is not NFKC. > > Thanks for the explanation Mark. It makes sense for implementations > whi

Re: full-width Latin missing from confusables data

2013-10-14 Thread Mark Davis
For the confusables, the presumption is that implementations have already either normalized the input to NFKC or have rejected input that is not NFKC. More broadly, in gathering data the main emphasis is on characters that fit the profile in http://www.unicode.org/reports/tr39/#Identifier_Characte

Re: More additional Greek (and Hebrew) characters needed for proposal

2013-09-21 Thread Mark Davis
http://www.unicode.org/faq/char_combmark.html#9 and following. Mark * * *— Il meglio è l’inimico del bene —* ** On Sat, Sep 21, 2013 at 7:38 PM, Robert Wheelock wrote: > Hello again, y’all! > > I’ve got quite a few characters (currently missing)

Re: Code point vs. scalar value

2013-09-20 Thread Mark Davis
Nicely stated. Mark * * *— Il meglio è l’inimico del bene —* ** On Thu, Sep 19, 2013 at 11:21 PM, Whistler, Ken wrote: > Stephan Stiller seems unconvinced by the various attempts to explain the > situation. Perhaps an authoritative explanation o

Re: Draft of LDML Specification for CLDR release 24

2013-09-13 Thread Mark Davis
Thanks for the feedback; the typo is fixed. Mark * * *— Il meglio è l’inimico del bene —* ** On Fri, Sep 13, 2013 at 1:19 AM, Philippe Verdy wrote: > Typo in section 2.3 "Number Symbols", for the new item > "superscriptingExponent" which describ

Re: polytonic Greek: diacritics above long vowels ᾱ, ῑ, ῡ

2013-08-05 Thread Mark Davis
> Classical Greek might qualify [for a CLDR entry] It certainly qualifies, but we require that a submitter commit to collecting a minimal amount of data before we add it. See http://cldr.unicode.org/index/cldr-spec/minimaldata Mark * * *— Il meglio

Re: Behdad Esfahbod won an O'Reilly Open Source Award!

2013-07-29 Thread Mark Davis
Great news, and well deserved! Congratulations, Behdad! Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Jul 29, 2013 at 9:41 PM, Roozbeh Pournader wrote: > Some of you probably have heard the news already, but in case you haven't, > Beh

Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

2013-07-19 Thread Mark Davis
Popping up a level. ICU (and some other libraries) have heuristic encoding detection, that will take a sequence of bytes and come up with a likely encoding id. Mark * * *— Il meglio è l’inimico del bene —* ** On Fri, Jul 19, 2013 at 8:40 PM, Whis

Re: The skywriter we hired has terrible Unicode support

2013-05-08 Thread Mark Davis
Saw that, thanks! Mark * * *— Il meglio è l’inimico del bene —* ** On Wed, May 8, 2013 at 8:26 PM, Tim Greenwood wrote: > http://xkcd.com/1209/ >

RE: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-20 Thread Mark Davis
LOL... {phone} On Apr 20, 2013 8:44 PM, "Erkki I Kolehmainen" wrote: > Mr. Overington, > > I'm sorry to have to admit that I cannot follow at all your train of > thought on what would be the practical value of localizable sentences in > any of the forms that you are contemplating. In my mind, th

Re: Rendering Raised FULL STOP between Digits

2013-03-10 Thread Mark Davis
Should the Unicode Consortium decide to recommend an existing (or new) character as a raised decimal for numbers, we would add that to CLDR, and recommend that implementations accept either one as equivalent when parsing. Mark * * *— Il meglio è l’i

Re: JSON version of CLDR

2013-03-03 Thread Mark Davis
I think just the main data is converted. If you want to request the other data you can file a cldr ticket. Mark * * *— Il meglio è l’inimico del bene —* ** On Sat, Mar 2, 2013 at 8:35 PM, Edwin Hoogerbeets wrote: > Hi all, I am trying to find the

Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-11 Thread Mark Davis
The draft update to LDML for collation is at http://unicode.org/repos/cldr/trunk/specs/ldml/tr35-collation.html. Bugs or requests can be filed at http://unicode.org/cldr/trac/newticket . Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Feb

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis
In practice and by design, treating isolated surrogates the same as reserved code points in processing, and then cleaning up on conversion to UTFs works just fine. It is a tradeoff that is up to the implementation. It has nothing to do with a "legacy of C pointer arithmetic". It does represent a p

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis
That's not the point (see successive messages). Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Jan 7, 2013 at 4:59 PM, "Martin J. Dürst" wrote: > On 2013/01/08 3:27, Markus Scherer wrote: > > Also, we commonly read code points from 16-

Re: Are there Unicode processors?

2013-01-07 Thread Mark Davis
That is not the typical way that Unicode text is processed. Typically whatever OS you are using will supply mechanisms for iterating through any Unicode string, returning each of the code points. It may also offer APIs for returning information about each character (called 'property values', or yo

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis
Because all well-formed sequences (and subsequences) are interpreted according to the corresponding UTF. That is quite different from random byte stream with no declared semantics, or a byte stream with a different declared semantic. Thus if you are given a Unicode 8-bit string <61, 62, 80, 63>, y

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-07 Thread Mark Davis
> But still non-conformant. That's incorrect. The point I was making above is that in order to say that something is "non-conformant", you have to be very clear what it is "non-conformant" *TO* . > Also, we commonly read code points from 16-bit Unicode strings, and > unpaired surrogates are retu

Re: If X sorts before Y, then XZ sorts before YZ ... example of where that's not true?

2013-01-06 Thread Mark Davis
There are many cases of such digraphs. Example from Slovak: c < d < h but cd < h < ch Cf http://www.unicode.org/reports/tr10/, searching for Slovak. Mark * * *— Il meglio è l’inimico del bene —* ** On Sun, Jan 6, 2013 at 1:56 PM, Costello, Roge

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-06 Thread Mark Davis
Some of this is simply historical: had Unicode been designed from the start with 8 and 16 bit forms in mind, some of this could be avoided. But that is water long under the bridge. Here is a simple example of why we have both UTFs and Unicode Strings. Java uses Unicode 16-bit Strings. The followin

Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Mark Davis
To assess whether a string is invalid, it all depends on what the string is supposed to be. 1. As Ken says, if a string is supposed to be in a given encoding form (UTF), but it consists of an ill-formed sequence of code units for that encoding form, it would be invalid. So an isolated surrogate (e

Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Mark Davis
http://www.unicode.org/alloc/CurrentAllocaiton.html => http://www.unicode.org/alloc/CurrentAllocation.html Mark * * *— Il meglio è l’inimico del bene —* ** On Fri, Jan 4, 2013 at 10:24 AM, Whistler, Ken wrote: > Stephan Stiller continued: > > >

Re: locale-aware string comparisons

2013-01-02 Thread Mark Davis
e case > insensitive. > > -Shawn > > -Original Message- > From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On > Behalf Of James Cloos > Sent: Tuesday, January 1, 2013 5:43 PM > To: Mark Davis ☕ > Cc: Whistler, Ken; unicode@unicode.org > Subject: R

Re: locale-aware string comparisons

2013-01-01 Thread Mark Davis
> 3. Regarding LDML and CLDR, somebody with specific expertise on CLDR James, Even without locale differences, the situation is a bit tricky. Assuming that str_tolower() and str_toupper() were straightforwardly defined in terms of the (full) Unicode case mappings, there is still the issue that the

Re: Character name translations

2012-12-20 Thread Mark Davis
There are different use cases, and I think they are getting confused. 1. Present a name for each character, some sort of formal name. I think this is probably the least useful for average users. 2. Allow searching for characters, eg in a character picker. Sample use case: search for "dash" (or th

Some much-needed improvements in JavaScript i18n

2012-12-19 Thread Mark Davis
I have a new google blog post about the new ECMAScript (JavaScript) internationalization spec. “Until now, it has been very difficult for web application designers to do something as simple as sort names correctly according to the user's language. And it matters: English readers wouldn’t expect År

Re: Question about normalization tests

2012-12-10 Thread Mark Davis
0300 *is* blocked, because there is a preceding character (0305) that has the same combining class (230). Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Dec 10, 2012 at 11:55 AM, Edwin Hoogerbeets wrote: > Looking at 0300, it is also no

Re: io9 describes Unicode as one of the 10 most unlikely things influenced by J.R.R. Tolkien

2012-12-08 Thread Mark Davis
> Their inference, it appears, is that had I not read Tolkien when I was 13 I would not be who I am today and the content of the Universal Character Set might be a lot different than it is. I doubt it. Many people are far more responsible for the structure, model, properties, and characters of Un

Re: StandardizedVariants.txt error?

2012-11-26 Thread Mark Davis
I agree with that analysis. Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Nov 26, 2012 at 1:53 PM, Whistler, Ken wrote: > Actually, I think the omission here is the word "canonical". In other > words, Section 16.4 should probably rea

Re: Caret

2012-11-12 Thread Mark Davis
> This case remains very infrequent: it is extremely rare to start typing text in With arrow keys or mouse clicking it is more frequent to end up on a directional boundary. Mark * * *— Il meglio è l’inimico del bene —* ** On Mon, Nov 12, 2012 at

Re: Character set cluelessness

2012-10-02 Thread Mark Davis
lt;https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 2:52 PM, Mark Davis ☕ wrote: > Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm > > Mark <https://plus.google.com/114199149796022210033> > * > * >

Re: Character set cluelessness

2012-10-02 Thread Mark Davis
Eg, in http://www.unece.org/fileadmin/DAM/cefact/locode/gr.htm Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Tue, Oct 2, 2012 at 1:49 PM, Mark Davis ☕ wrote: > I tend to agree. What would be useful is to have one column for the

Re: Character set cluelessness

2012-10-02 Thread Mark Davis
I tend to agree. What would be useful is to have one column for the city in the local language (or more columns for multilingual cities), but it is extremely useful to have an ASCII version as well. Mark * * *— Il meglio è l’inimico del bene —* **

Re: Announcing The Unicode Standard, Version 6.2

2012-09-26 Thread Mark Davis
BTW, if you want to share the announcement: - Google+: https://plus.sandbox.google.com/u/0/109412260435993059737/posts (I also reposted at with my personal account .) - Facebook: http://www.facebook.com/pages/Friends-of-Unicode/12778

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Mark Davis
mico del bene —* ** On Thu, Sep 6, 2012 at 4:02 PM, Jukka K. Korpela wrote: > 2012-09-07 1:54, Mark Davis ☕ wrote: > > This might come off as a bit snarky, but do you /really/ think the >> >> author and every one of the commentators on the thread all really meant >&

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Mark Davis
On Thu, Sep 6, 2012 at 3:25 PM, Jukka K. Korpela wrote: > 2012-09-07 0:59, Mark Davis ☕ wrote: > > They might be distinct in Finnish, but in English only in specialized >> contexts, >> > > This is not about everyday language (which is irrelevant in this context) >

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Mark Davis
think* they understand, but which are being used with a specialized, non-customary meaning. Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Thu, Sep 6, 2012 at 2:07 PM, Jukka K. Korpela wrote: > 2012-09-06 23:47, Mark Davis ☕ w

Re: Compiling a list of Semitic transliteration characters

2012-09-06 Thread Mark Davis
The distinction between "transliteration" and "transcription" is limited to a few people. It is far better to use unambiguous terms, like "lossy" vs "lossless". Romanization (a transliteration/transcription into Latin script) in general can be either. Romanization of Chinese ideographs is particul

UTF-8 turns 20

2012-09-06 Thread Mark Davis
Rob PikeYesterday 6:12 PM - Public UTF-8 turned 20 years old yesterday. It's been well documented elsewhere ( http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt) that one Wednesday night, after a phone call from X/Open, Ken

Re: Searching data: map countries to scripts

2012-08-22 Thread Mark Davis
My thought exactly; people seem to be going off in the weeds. Someone in transit in the Auckland Airport in a moment of weakness took who Pendulums, Astrology and Runes class. Does that mark New Zealand as using the Runic script? The stated goal of the original question was around mapping countri

Re: Searching data: map countries to scripts

2012-08-20 Thread Mark Davis
Cldr has both of those, at least for official and defacto-official languages. {phone} On Aug 20, 2012 4:03 PM, "Manuel Strehl" wrote: > > This might not work too well, since the ISO 15924 code elements you're > > thinking of are "Hira" and "Kana". > > This awkward moment... I'm trying to figure

Re: U+25CA LOZENGE - why is it in the "Mac OS Roman" character set (and therefore widespread in current fonts)?

2012-08-13 Thread Mark Davis
I joined the Lisa group in late '83, and that was soon absorbed into the Mac group. As I recall, the MacRoman character set was already done, based on the Lisa. This predated the laserwriter, so that wasn't the origin. The long 'f' was for use as a currency symbol (particularly for Gulden). I don'

Re: CLDR and ICU

2012-07-27 Thread Mark Davis
owsers. This thread is getting tiresome. Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Fri, Jul 27, 2012 at 5:00 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > On Fri, 27 Jul 2012 14:14:05 -0700 > Mar

Re: CLDR and ICU

2012-07-27 Thread Mark Davis
The key term is 'open interchange'. "In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, su

Re: Unicode String Models -- minor proofreading nit (was: Unicode String Models)

2012-07-26 Thread Mark Davis
Thanks, good suggestion. Mark * * *— Il meglio è l’inimico del bene —* ** On Thu, Jul 26, 2012 at 12:40 PM, CE Whitehead wrote: > "Validation;" par 3, comment in parentheses > ". . . (you never want to just delete it; that has security problems).

Re: CLDR and ICU

2012-07-25 Thread Mark Davis
Mark * * *— Il meglio è l’inimico del bene —* ** On Wed, Jul 25, 2012 at 5:01 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > What is the formal relationship between the Common Locale Data > Repository (CLDR) and International C

Re: Unicode String Models

2012-07-20 Thread Mark Davis
Thanks, nice article. We got into some of those hair caret positioning issues back at Apple; we even had a design that would associate a series of lines (which could be slanted and positioned) with a ligature, but ultimately 1/m gets you 99% of the value, with very little cost. (My article was jus

Unicode String Models

2012-07-20 Thread Mark Davis
I put together some notes on different ways for programming languages to handle Unicode at a low level. Comments welcome. Macchiato »

Re: Meaning of Numeric Type "digit"

2012-07-11 Thread Mark Davis
The decimal digits only include those characters that are used as part of a standard positional decimal system. (We would be more consistent about terminology, however.) -- Mark * * *— Il meglio è l’inimico del bene —* **

Re: Too narrowly defined: DIVISION SIGN & COLON

2012-07-10 Thread Mark Davis
default spacing rules, the first is the ratio (which is spaced > as a relational symbol) and the second is the colon (which is spaced as > punctuation mark), both in math mode, and the last one is the colon in > text mode. > > On Tue, Jul 10, 2012 at 04:22:06PM -0700, Mark Davis ☕ wrote:

Re: Too narrowly defined: DIVISION SIGN & COLON

2012-07-10 Thread Mark Davis
e.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Tue, Jul 10, 2012 at 5:05 PM, Ken Whistler wrote: > On 7/10/2012 4:22 PM, Mark Davis ☕ wrote: > > I would disagree about the preference for ratio; I think it is a > historical accident in Unicode. > >

Re: Too narrowly defined: DIVISION SIGN & COLON

2012-07-10 Thread Mark Davis
tps://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Tue, Jul 10, 2012 at 5:07 PM, Philippe Verdy wrote: > 2012/7/11 Mark Davis ☕ : > > I would disagree about the preference for ratio; I think it is a > historical > > accident in Unicode. >

Re: Too narrowly defined: DIVISION SIGN & COLON

2012-07-10 Thread Mark Davis
I would disagree about the preference for ratio; I think it is a historical accident in Unicode. What people use and have used for ratio is simply a colon. One writes 3:5, and I doubt that there was a well-established visual difference that demanded a separate code for it, so someone would need to

Re: Unicode 6.2.0 Beta Collation Tests

2012-07-08 Thread Mark Davis
Markus brought to my attention that the clause didn't capture what the UTC decided; he's been out of town, and we're meeting tomorrow to go through all the issues. -- Mark * * *— Il meglio è l’inimico del bene —* ** On

Using the Unicode Glossary

2012-06-25 Thread Mark Davis
(http://unicode-inc.blogspot.com/2012/06/using-unicode-glossary.html) ** The Unicode glossary is useful for people doing documents, specifications, and general-purpose articles. Each of the glossary entries now has a link on it, and clicking on that link exposes it in the address bar of your brows

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-22 Thread Mark Davis
single entry that listed 2 > > stroke counts. This seems odd as there should be other stroke count > > differences between Simplified and Traditional Chinese. I suspect that > this > > is an area needing more than one correction -- it would be better to do a > > systemat

Re: [cldr-dev] Re: Questions on Chinese collation, stroke

2012-06-08 Thread Mark Davis
nese, U+8303 has 9 strokes as Matt mentioned in the >> email. >> >> The radical "++" is counted as 4 strokes. I think there are several >> radicals have the same issue, different stroke counts, between simplified >> Chinese and traditional Chinese. >> >

Re: Questions on Chinese collation, stroke

2012-06-07 Thread Mark Davis
On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma wrote: > Hi, > > I have two questions regarding the collation sequence defined in > zh.xml, CLDR 21.0 > > 1. Why is U+8303 (范) counted as 9 strokes instead of 8 for type="stroke">? As a reference, U+59DA (姚) is counted as 9 strokes but > sorted before U+8

Flag emoji

2012-05-31 Thread Mark Davis
convinced during the discussion that ZWJ was a better approach. -- Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Thu, May 31, 2012 at 2:47 AM, Andrew West wrote: > On 31 May 2012 00:24, Mark Davis ☕ wrote:

Re: Flag tags (was: Re: Unicode 6.2 to Support the Turkish Lira Sign)

2012-05-30 Thread Mark Davis
There is definitely a problem. The origin is complicated. All that anyone really needed were 10 characters for emoji flags, encoded as compatibility characters. However, certain people (I'll call Completionists) who think that if you encode one member of a set (even for compatibility characters!),

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Mark Davis
There is an action item from the UTC and CLDR committees to clarify the meanings of the setting; they are supposed to allow some degree of variation. -- Mark * * *— Il meglio è l’inimico del bene —* ** On Fri, May 18, 2

Re: Mark-Driven Script Categorisation

2012-05-17 Thread Mark Davis
absolutely -- Mark * * *— Il meglio è l’inimico del bene —* ** On Thu, May 17, 2012 at 6:07 PM, Peter Constable wrote: > Whatever Emacs or other implementations use, I'd consider 00D7 a better > choice than 0078 for a

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Mark Davis
No, it's not. Including x in Lao for some pedagogical (I'm guessing) purpose is completely out of scope. That'd be like including π in Latin because it sometimes occurs in the middle of English text. -- Mark * * *— Il meg

Re: U+2018 is not RIGHT HIGH 6

2012-04-30 Thread Mark Davis
FYI, we have gathered in CLDR on usage of characters in different languages, including quotation marks (and those to use for embeddings). It is at http://unicode.org/repos/cldr-tmp/trunk/beta-charts/by_type/misc.characters.html . (The page takes a while to load because of the exemplar information

Re: Encoding of Numbers Composed of Decimal Digits (General Category of Nd)

2012-04-28 Thread Mark Davis
We don't have that as a policy ( http://www.unicode.org/policies/property_value_stability_table.html). It's worth proposing via the feedback form, because that is the expectation. -- Mark * * *— Il meglio è l’inimico del b

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis
* *— Il meglio è l’inimico del bene —* ** 2012/4/27 Cristian Secară > În data de Fri, 27 Apr 2012 12:26:25 -0700, Mark Davis ☕ a scris: > > > Actually, if the goal is to get as many characters in as possible, > > Punycode might be the best solution. That is the encoding used

Re: Unicode, SMS and year 2012

2012-04-27 Thread Mark Davis
Actually, if the goal is to get as many characters in as possible, Punycode might be the best solution. That is the encoding used for internationalized domains. In that form, it uses a smaller number of bytes per character, but a parameterization allows use of all byte values.

Re: Key Curry : Attempting to make it easy to type world languages and orthographies on the web

2012-04-17 Thread Mark Davis
FYI, we have a draft proposal for keyboard data for CLDr that may be interesting for you. http://unicode.org/repos/cldr/trunk/keyboards/ -- Mark * * *— Il meglio è l’inimico del bene —* ** On Tue, Apr 17, 2012 at 20:02

Re: Fake Unicode

2012-03-24 Thread Mark Davis
You can check on the pipeline of characters at http://unicode.org/alloc/Pipeline.html, with some pointers to how to make proposals, if you are interested -- Mark * * *— Il meglio è l’inimico del bene —* ** On Sat, M

Fake Unicode

2012-03-22 Thread Mark Davis
If you haven't seen this page, it's pretty funny: https://plus.google.com/u/0/109925364564856140495/posts. My favorite is [but you have to have played Skyrim to appreciate it.]: "I USED TO BE A LATIN CAPITAL LETTER K LIKE YOU THEN I TOOK AN ARROW IN THE KNEE" (U+10182) A close second is the lates

Re: ZWNJ And Non-spacing Marks

2012-02-27 Thread Mark Davis
Extended block) by doing this? > > On Tue, Feb 28, 2012 at 4:04 AM, Mark Davis ☕ wrote: > >> The biggest issue for indic is where the (n)j occurs before a halant. >> > > Can Mark explain this? What is the problem when ञ occurs before a halant? > > -- > Shriramana Sharma > >

Re: ZWNJ And Non-spacing Marks

2012-02-27 Thread Mark Davis
found is Devanagari: SA + ZWNJ + ANUSVARA. Does > this have some special meaning, or is it the same as the A-ACUTE case? > > Regards, > Eric > > > On 2/27/12 9:23 AM, Mark Davis ☕ wrote: > > In TUS, in http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf > > D56 Combinin

Re: ZWNJ And Non-spacing Marks

2012-02-27 Thread Mark Davis
In TUS, in http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf D56 Combining character sequence: A maximal character sequence consisting of either a base character followed by a sequence of one or more characters where each is a combining character, zero width joiner, or zero width non-jo

Over 60% of the web

2012-02-04 Thread Mark Davis
There's an posting with an updated graph of the percentage of the web in Unicode, for those interested: http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html Mark

Re: UCA tertiary weight assignment vs. decomposition type definition in Unicode character database

2012-01-27 Thread Mark Davis
CLDR doesn't modify anything but primaries in the root ordering. Particular languages may modify any of the levels, but I don't think anything is typically done except for primary and secondary (with the exception of Japanese, which is quite complicated). Mark *— Il meglio è l’inimico del bene —*

Article in NYT on N’ko

2011-12-12 Thread Mark Davis
Congrats to Peter Constable, quoted in this article on indigenous languages’ being supported on the internet and in products. Mark *— Il meglio è l’inimico del bene —* * [https://plus.google.com/114199149796022210033] * Everyone Speaks Text Message

Re: Question on UCA collation parameters (strength = tertiary, alternate = shifted)

2011-11-29 Thread Mark Davis
Yes, if the strength is tertiary, then Blanked and Shifted give the same results. http://www.unicode.org/reports/tr10/proposed.html#Variable_Weighting Mark *— Il meglio è l’inimico del bene —* * * * [https://plus.google.com/114199149796022210033] * On Tue, Nov 29, 2011 at 19:11, Matt Ma wrote

Re: UCA: is there a xml version of allkeys.txt available

2011-11-16 Thread Mark Davis
With Unicode 6.1.0, the DUCET is being changes so that the CLDR root no longer needs a tailoring, so once that is releases it should be what you want. A beta version is at http://unicode.org/Public/UCA/6.1.0/CollationAuxiliary.zip (It will be changed before release, so don't use it in production.

Google+ internationalization

2011-10-19 Thread Mark Davis
A friend suggested that some people here might be interested in the video of Luke‘s & my talk on Google+ internationalization. https://plus.google.com/u/0/114199149796022210033/posts/hBpr3Fmm1mf Mark

Re: Japanese font on Non-Japanese Android phones

2011-10-08 Thread Mark Davis
It wasn't really a confession... that makes it sound like a crime. I did write a bit quickly at that time, though. In more detail, language tags are signals for the detection; they can be taken into account to a greater or lesser degree, depending on the strength of other signals (like the content

Re: UAX #14 (UCA): Derived primary weight ranges

2011-09-12 Thread Mark Davis
I don't think there is any particular value to that restructuring, from what I can make of your email. Note also, with regard to your message about 'real' weights, that there is no requirement that implementations preserve the DUCET values, as long as the ordering is the same. In particular, CLDR

Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-09-02 Thread Mark Davis
There are really a few purposes for this list. 1. Cover the aliases for a given character that are in *very* widespread use in the industry. 2. Cover the aliases for a given character that we have recommended that people use in UTS #18, for quite some time. 3. *Most importantly, res

Re: PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

2011-08-31 Thread Mark Davis
, Aug 31, 2011 at 10:11, Philippe Verdy wrote: > 2011/8/31 Mark Davis ☕ wrote: > > You should always look at the modifications section to see what has > changed. > > We are now using a uniform naming for the proposed versions of documents, > > and we use the same anchor

Re: PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

2011-08-31 Thread Mark Davis
less robust). Mark *— Il meglio è l’inimico del bene —* On Wed, Aug 31, 2011 at 10:03, Philippe Verdy wrote: > 2011/8/31 Mark Davis ☕ : > >> Another interesting question is: how can we encode in texts the fact > >> that a character usually considered as a ligature in

Re: PRI#203: UTS#10 (UCA) update : characters needed to avoid contractions or expansions

2011-08-31 Thread Mark Davis
Thanks for bringing this up. Mark *— Il meglio è l’inimico del bene —* On Tue, Aug 30, 2011 at 19:20, Philippe Verdy wrote: > In the proposed update of UTS#10 (UCA), subject to the PRI #203 just > posted, I note the following addition in section 3.3.2 (Contractions). > > "Characters of a contr

Re: Difference between Bidi_Class 'R' and 'AL'

2011-08-24 Thread Mark Davis
The difference between them is subtle (and I've long been convinced that having the distinction was a mistake, but that's water under the bridge). It is in their effect on European numbers that occur after them, in http://www.unicode.org/reports/tr9/#W2 (and following). Mark *— Il meglio è l’inim

Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Mark Davis
>just as they thought in the late 1980s that 16 bits Under the original design principles of Unicode, the goal was a bit more limited; we envisioned composition for Hangul, no need for the chunk of presentation formats, a generative mechanism for infrequent CJK ideographs, and people's using the P

Re: Endangered Alphabets

2011-08-19 Thread Mark Davis
Unicode is not architected for code pages. As John said, if you want to use code pages—reusing the same bytes for different purposes—then ISO 2022 would probably be your tool of choice. Within Unicode, you are free to use PUA codes for whatever purpose you want. If you get enough people together w

Re: Fwd: Endangered Alphabets

2011-08-19 Thread Mark Davis
+1 Mark *— Il meglio è l’inimico del bene —* On Fri, Aug 19, 2011 at 08:41, John Cowan wrote: > Michael Everson scripsit: > > > I'd like to invite everyone to support this worthwhile project: > > Worthwhile it may be, but surely misinformed as well. Does Mr. Brooks > actually suppose that fif

Re: RTL PUA?

2011-08-19 Thread Mark Davis
All of the property assignments to PUA characters (except the GC) are purely informative. The property assignments that are there are simply based on the likelyhood of property assignment, and can be freely overridden by implementations. It is just more likely that PUA characters are bc:L than that

Re: [bidi] PRI 185 Revision of UBA for improved display of URL/IRIs

2011-07-27 Thread Mark Davis
Just to remind people: posting to this list does *not* mean submitting to the UTC. If you want to discuss a proposal here, not a problem, but just remember that if you want any action you have to submit to the UTC. Unicode members via: http://www.unicode.org/members/docsubmit.html Others via: http

  1   2   >