from:"Mark Davis ☕️"

Re: UAX #29 and WB4

2020-03-04 Thread Mark Davis ☕️ via Unicode

One thing we have considered for a while is whether to do a rewrite of the
rules to simplify the processing (and avoid the "treat as" rules), but it
would take a fair amount of design work that we haven't had time to do. If
you (or others) are interested in getting involved, please let us know.

Mark


On Wed, Mar 4, 2020 at 11:30 AM Daniel Bünzli via Unicode <
unicode@unicode.org> wrote:

> On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buen...@erratique.ch)
> wrote:
>
> > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buen...@erratique.ch)
> wrote:
> >
> > > Re-reading the text I suspect I should not restart the rules from the
> first one when a
> > WB4
> > > rewrite occurs but only apply the subsequent rules. Is that correct ?
> >
> > However even if that's correct I don't understand how this test case
> works:
> >
> > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO
> WIDTH JOINER (ZWJ_FE)
> > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
> >
> > Here the first two chars get rewritten with WB4 to ExtPic then if only
> subsequent rules
> > are applied we end up in WB999 and a break between 200D and 1F6D1.
>
> That's nonsense and not the operational model of the algorithm which IIRC
> was once clearly stated on this list by Mark Davis (sorry I failed to dig
> out the message) which is to take each boundary position candidate and
> apply the rule in sequences taking the first one that matches and then
> start over with the next one.
>
> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but
> then that implicitely adds a non boundary condition -- this is not really
> evident from the formalism but see the comment above WB4, for that boundary
> position that settles the non boundary condition. Then we start again
> applying the rules between 200D and the last 1F6D1 and WB3c matches before
> WB4 quicks.
>
> I think the behaviour of → rules should be clarified: it's not clear on
> which data you apply it w.r.t. the boundary position candiate. If I
> understand correctly if the match spans over the boundary position
> candidate that simply turns it into a non-boundary. Otherwise you apply the
> rule on the left of the boundary position candiate.
>
> Regarding the question of my original message it seems at a certain point
> I knew better:
>
>   https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html
>
> Sorry for the noise.
>
> Daniel
>
> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the
> operational model of the rules a bit (I also have the impression that the
> formalism to express all that may not be the right one, but then I don't
> have something better to propose at the time). Also it would be nicer for
> implementers if they didn't have to factorize rules themselves (e.g. like
> in the new LB30 rules of UAX14) so that correctness of implemented rules is
> easier to assert.
>
>
>
>

Re: Combining Marks and Variation Selectors

2020-02-02 Thread Mark Davis ☕️ via Unicode

I don't think there is a technical reason for disallowing variation
selectors after any starters (ccc=000); the normalization algorithm doesn't
care about the general category of characters.

Mark


On Sun, Feb 2, 2020 at 10:09 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sun, 2 Feb 2020 07:51:56 -0800
> Ken Whistler via Unicode  wrote:
>
> > What it comes down to is avoidance of conundrums involving canonical
> > reordering for normalization. The effect of variation selectors is
> > defined in terms of an immediate adjacency. If you allowed variation
> > selectors to be defined for combining marks of ccc!=0, then
> > normalization of sequences could, in principle, move the two apart.
> > That would make implementation of the intended rendering much more
> > difficult.
>
> I can understand that for non-starters.  However, a lot of non-spacing
> combining marks are starters (i.e. ccc=0), so they would not be a
> problem.   is an unbreakable block in
> canonical equivalence-preserving changes.  Is this restriction therefore
> just a holdover from when canonical equivalence could be corrected?
>
> Richard.
>

Re: Call for feedback on UTS #18: Unicode Regular Expressions

2020-01-02 Thread Mark Davis ☕️ via Unicode

The line just above that is:

Name matching rules follow Matching Rules
 from [UAX44#UAX44-LM2
].

The deletion was based on feedback that the deleted text was a recap of the
above line, but a recap that didn't have precisely the same description.
It's best to point to the exact description, and have that be in one place.

Mark


On Thu, Jan 2, 2020 at 6:40 PM Karl Williamson via Unicode <
unicode@unicode.org> wrote:

> One thing I noticed in reviewing this is the removal of text about loose
> matching of the name property.  But I didn't see an explanation for this
> removal.  Please point me to the explanation, or tell me what it is.
>
> Specifically these lines were removed:
>
> As with other property values, names should use a loose match,
> disregarding case, spaces and hyphen (the underbar character "_" cannot
> occur in Unicode character names). An implementation may also choose to
> allow namespaces, where some prefix like "LATIN LETTER" is set globally
> and used if there is no match otherwise.
>
> There are, however, three instances that require special-casing with
> loose matching, where an extra test shall be made for the presence or
> absence of a hyphen.
>
>  U+0F68 TIBETAN LETTER A and
>  U+0F60 TIBETAN LETTER -A
>  U+0FB8 TIBETAN SUBJOINED LETTER A and
>  U+0FB0 TIBETAN SUBJOINED LETTER -A
>  U+116C HANGUL JUNGSEONG OE and
>  U+1180 HANGUL JUNGSEONG O-E
>
>
>

Re: Proposal to add Roman transliteration schemes to ISO 15924.

2019-12-02 Thread Mark Davis ☕️ via Unicode

Filed the following, thanks Richard.
CLDR-13445 

Release link for "latest" goes to zip file








On Tue, Dec 3, 2019 at 2:31 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Mon, 2 Dec 2019 09:09:02 -0800
> Markus Scherer via Unicode  wrote:
>
> > On Mon, Dec 2, 2019 at 8:42 AM Roozbeh Pournader via Unicode <
> > unicode@unicode.org> wrote:
> >
> > > You don't need an ISO 15924 script code. You need to think in terms
> > > of BCP 47. Sanskrit in Latin would be sa-Latn.
> > >
> >
> > Right!
> >
> > Now, if you want to distinguish the different transcription systems
> > for
> > > writing Sanskrit in Latin, you can apply to registry a BCP 47
> > > variant. There are also BCP 47 extension T, which may also be
> > > useful to you:
> > >
> > > https://tools.ietf.org/html/rfc6497
> > >
> >
> > And that extension is administered by Unicode, with documentation and
> > data here:
> > http://www.unicode.org/reports/tr35/tr35.html#t_Extension
>
> But that says that the definitions are at
>
> https://github.com/unicode-org/cldr/releases/tag/latest/common/bcp47/transform.xml
> ,
> but all one currently gets from that is an error message 'XML Parsing
> Error: no element found'.
>

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-13 Thread Mark Davis ☕️ via Unicode

The problem is that most regex engines are not written to handle some
"interesting" features of canonical equivalence, like discontinuity.
Suppose that X is canonically equivalent to AB.

   - A query /X/ can match the separated A and C in the target string
   "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it
   behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should
   the output be rearranged/decomposed so that it is NFD? and so on.
   - A query /A/ can match *part* of the X in the target string "aXb". So
   if I have code to do [replace /A/ in "aXb" by "pq"], what should result:
   "apqBb"?

The syntax and APIs for regex engines are not built to handle these
features. It introduces a enough complications in the code, syntax, and
semantics that no major implementation has seen fit to do it. We used to
have a section in the spec about this, but were convinced that it was
better off handled at a higher level.

Mark


On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
>
> On Sun, 13 Oct 2019 17:13:28 -0700
> Asmus Freytag via Unicode   wrote:
>
>
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not .  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
>
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
>
> Am I missing anything?
>
> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
> should not match  CIRCUMFLEX ACCENT>.
>
> Why does it matter if it is precomposed? Why should it? (For anyone other
> than a character coding maven).
>
>  Now, I could invent a string property so
> that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).
>
> I don't entirely understand what you said; you may have missed the
> distinction between "[:Lu:] can then match" and "[:Lu:] will then
> match".  I think only Greek letters expand to 4 characters in NFD.
>
> When I'm respecting canonical equivalence/working with traces, I want
> [:insc=vowel_dependent:][:insc=tone_mark:] to match both  CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
> equivalent .  The canonical closure of that
> sequence can be messy even within scripts.  Some pairs commute: others
> don't, usually for good reasons.
>
> Some models may be more natural for different scripts. Certainly, in SEA
> or Indic scripts, most combining marks are not best modeled with properties
> as "inherited". But for L/G/C etc. it would be a different matter.
>
> For general recommendations, such as UTS#18, it would be good to move the
> state of the art so that the "primitives" are in line with the way typical
> writing systems behave, so that people can write "linguistically correct"
> regexes.
>
> A./
>
>
> Regards,
>
> Richard.
>
>
>
>

Re: Pure Regular Expression Engines and Literal Clusters

2019-10-11 Thread Mark Davis ☕️ via Unicode

>
> You claimed the order of alternatives mattered.  That is an important
> issue for anyone rash enough to think that the standard is fit to be
> used as a specification.
>

Regex engines differ in how they handle the interpretation of the matching
of alternatives, and it is not possible for us to wave a magic wand to
change them.

What we can do is specify how the interpretation of the properties of
strings works. By specifying that they behave like alternation AND adding
the extra constraint of having longer first, we minimize the differences
across regex engines.

>
> I'm still not entirely clear what a regular expression /[\u00c1\u00e1]/
> can mean.  If the system uses NFD to simulate Unicode conformance,
> shall the expression then be converted to /[{A\u0301}{a\u0301}]/?  Or
> should it simply fail to match any NFD string?  I've been implementing
> the view that all or none of the canonical equivalents of a string
> match.  (I therefore support mildly discontiguous substrings, though I
> don't support splitting undecomposable characters.)
>

We came to the conclusion years ago that regex engines cannot reasonably be
expected to implement canonical equivalence; they are really working at a
lower level. So you see the advice we give at
http://unicode.org/reports/tr18/#Canonical_Equivalents. (Again, no magic
wand.)


> Richard.
>

Unicode website glitches. (was The Most Frequent Emoji)

2019-10-11 Thread Mark Davis ☕️ via Unicode

There was a caching problem with WordPress, where you have to do a hard
reload in some browsers. See if the problem still exists, and if the hard
reload fixes it. If anyone else is having trouble with that, let us know.

BTW, if you want to comment on the format as opposed to glitches, please
change the subject line.

Mark


On Thu, Oct 10, 2019 at 11:50 PM Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> I had a look at the page with the frequencies. Many emoji didn't
> display, but that's my browser's problem. What was worse was that the
> sidebar and the stuff at the bottom was all looking weird. I hope this
> can be fixed.
>
> Regards,   Martin.
>
>  Forwarded Message 
> Subject: The Most Frequent Emoji
> Date: Wed, 09 Oct 2019 07:56:37 -0700
> From: announceme...@unicode.org
> Reply-To: r...@unicode.org
> To: announceme...@unicode.org
>
> Emoji Frequency ImageHow does the Unicode Consortium choose which new
> emoji to add? One important factor is data about how frequently the
> current emoji are used. Patterns of usage help to inform decisions about
> future emoji. The Consortium has been working to assemble this
> information and make it available to the public.
>
> And the two most frequently used emoji in the world are...
>  and ❤️
> The new Unicode Emoji Frequency
> <https://home.unicode.org/emoji/emoji-frequency> page shows a list of
> the Unicode v12.0 emoji ranked in order of how frequently they are used.
>
> “The forecasted frequency of use is a key factor in determining whether
> to encode new emoji, and for that it is important to know the frequency
> of use of existing emoji,” said Mark Davis, President of the Unicode
> Consortium. “Understanding how frequently emoji are used helps
> prioritize which categories to focus on and which emoji to add to the
> Standard.”
>
> 
> /Over 136,000 characters are available for adoption
> <http://unicode.org/consortium/adopt-a-character.html>, to help the
> Unicode Consortium’s work on digitally disadvantaged languages./
>
> [badge] <http://unicode.org/consortium/adopt-a-character.html>
>
> http://blog.unicode.org/2019/10/the-most-frequent-emoji.html
>
>
>

Re: Unicode "no-op" Character?

2019-07-03 Thread Mark Davis ☕️ via Unicode

Your goal is not achievable. We can't wave a magic wand, and suddenly (or
even within decades) all processes everywhere ignore U+000F in all
processing will not happen.

This thread is pointless and should be terminated.

Mark


On Wed, Jul 3, 2019 at 5:48 PM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> I’m frustrated at how badly you seem to be missing the point. There is
> nothing impossible nor self-contradictory here. There is only the matter
> that Unicode requires all scalar values to be preserved during interchange.
> This is in many ways a good idea, and I don’t expect it to change, but
> something else would be possible if this requirement were explicitly
> dropped for a well-defined small subset of characters (even just one
> character). A modern-day SYN.
>
>
>
> Let’s say it’s U+000F. The standard takes my proposal and makes it a
> discardable, null-displayable character. What does this mean?
>
>
>
> U+000F may appear in any text. It has no (external) semantic value. But it
> may appear. It may appear a lot.
>
>
>
> Display routines (which are already dealing with combining, ligaturing,
> non-/joiners, variations, initial/medial/finals forms) understand that
> U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move
> to the next character. Simple.
>
>
>
> Security gateways filter it out completely, as a matter of best practice
> and security-in-depth.
>
>
>
> A process, let’s call it Process W, adds a bunch of U+000F to a string it
> received, or built, or a user entered via keyboard. Maybe it’s to
> packetize. Maybe to mark every word that is an anagram of the name of a
> famous 19th-century painter, or that represents a pizza topping. Maybe
> something else. This is a versatile character. Process W is done adding
> U+000F to the string. It stores in it a database UTF-8 encoded field.
> Encoding isn’t a problem. The database is happy.
>
>
>
> Now Process X runs. Process X is meant to work with Process W and it’s
> well-aware of how U+000F is used. It reads the string from the database. It
> sees U+000F and interprets it. It chops the string into packets, or does a
> websearch for each famous painter, or it orders pizza. The private meaning
> of U+000F is known to both Process X and Process W. There is useful
> information encoded in-band, within a limited private context.
>
>
>
> But now we have Process Y. Process Y doesn’t care about packets or
> painters or pizza. Process Y runs outside of the private context that X and
> W had. Process Y translates strings into Morse code for transmission. As
> part of that, it replaces common words with abbreviations. Process Y
> doesn’t interpret U+000F. Why would it? It has no semantic value to Process
> Y.
>
>
>
> Process Y reads the string from the database. Internally, it clears all
> instances of U+000F from the string. They’re just taking up space. They’re
> meaningless to Y. It compiles the Morse code sequence into an audio file.
>
>
>
> But now we have Process Z. Process Z wants to take a string and mark every
> instance of five contiguous Latin consonants. It scrapes the database
> looking for text strings. It finds the string Process W created and marked.
> Z has no obligation to W. It’s not part of that private context. Process Z
> clears all instances of U+000F it finds, then inserts its own wherever it
> finds five-consonant clusters. It stores its results in a UTF-16LE text
> file. It’s allowed to do that.
>
>
>
> Nothing impossible happened here. Let’s summarize:
>
>
>
> Processes W and X established a private meaning for U+000F by agreement
> and interacted based on that meaning.
>
>
>
> Process Y ignored U+000F completely because it assigned no meaning to it.
>
>
>
> Process Z assigned a completely new meaning to U+000F. That’s permitted
> because U+000F is special and is guaranteed to have no semantics without
> private agreement and doesn’t need to be preserved.
>
>
>
> There is no need to escape anything. Escaping is used when a character
> must have more than one meaning (i.e. it is overloaded, as when it is both
> text and markup). U+000F only gets one meaning in any context. In a new
> context, the meaning gets overridden, not overloaded. That’s what makes it
> special.
>
>
>
> I don’t expect to see any of this in official Unicode. But I take
> exception to the idea that I’m suggesting something impossible.
>
>
>
>
>
> *From:* Philippe Verdy [mailto:verd...@wanadoo.fr]
> *Sent:* Wednesday, July 03, 2019 04:49
> *To:* Sławomir Osipiuk
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Unicode "no-op" Character?
>
>
>
> Your goal is **impossible** to reach with Unicode. Assume sich character
> is "added" to the UCS, then it can appear in the text. Your goal being that
> it should be "warrantied" not to be used in any text, means that your
> "character" cannot be encoded at all.
>

Re: Unicode "no-op" Character?

2019-06-22 Thread Mark Davis ☕️ via Unicode

There nothing like what you are describing. Examples:

   1. Display — There are a few of the Default Ignorables that are always
   treated as invisible, and have little effect on other characters. However,
   even those will generally interfere with the display of sequences (be
   between 'q' and U+0308 ( q̈ ); within emoji sequences, within ligatures,
   etc), line break, etc.
   2. Interpretation — There is no character that would always be ignored
   by all processes. Some processes may ignore some characters (eg a search
   indexer may ignore most default ignorables), but there is nothing that all
   processes will ignore.

The only exception would be if some cooperating processes that had agreed
beforehand to strip some particular character.

Mark

On Sat, Jun 22, 2019 at 6:49 AM Sławomir Osipiuk via Unicode <
unicode@unicode.org> wrote:

> Does Unicode include a character that does nothing at all? I’m talking
> about something that can be used for padding data without affecting
> interpretation of other characters, including combining chars and
> ligatures. I.e. a character that could hypothetically be inserted between a
> latin E and a combining acute and still produce É. The historical
> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what
> I want. It only has one slight disadvantage: it doesn’t work. All software
> I’ve tried displays it as an unknown character and it definitely breaks up
> combinations. And U+ NULL seems even worse.
>
>
>
> I can imagine the answer is that this thing I’m looking for isn’t a
> character at all and so should be the business of “a higher-level protocol”
> and not what Unicode was made for… but Unicode does include some odd things
> so I wonder if there is something like that regardless. Can anyone offer
> any suggestions?
>
>
>
> Sławomir Osipiuk
>

Re: Unicode CLDR 35 alpha available for testing

2019-03-05 Thread Mark Davis ☕️ via Unicode

Just via svn checkout for the alpha.

By next time we plan to be on GitHub...

{phone}

On Thu, Feb 28, 2019, 13:07 Doug Ewell via Unicode 
wrote:

> announcements at unicode.org wrote:
>
> > The alpha version of Unicode CLDR 35
> >  is available for
> > testing.
>
> No downloadable data files in the sense of released builds, correct?
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark Davis ☕️ via Unicode

That is a fair point; if you could get everyone to use keyboards that
inserted such a character, and also get people with current data (eg
Thesaurus Linguae Graecae to process their text), then it would behave as
expected.

Mark


On Mon, Jan 28, 2019 at 8:55 AM James Kass via Unicode 
wrote:

>
> On 2019-01-28 7:31 AM, Mark Davis ☕️ via Unicode wrote:
> > Expecting people to type in hard-to-find invisible characters just to
> > correct double-click is not a realistic expectation.
>
> True, which is why such entries, when consistent, are properly handled
> at the keyboard driver level.  It's a presumption that Greek classicists
> are already specifying fonts and using dedicated keyboard drivers.
> Based on the description provided by James Tauber, it should be
> relatively simple to make the keyboard insert some kind of joiner before
> U+2019 if it follows a Greek letter. This would not be visible to the
> end-user.
>
> This approach would also mean that plain-text, which has no language
> tagging mechanism, would "get it right" cross-platform, cross-applications.
>
>

Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark Davis ☕️ via Unicode

It would certainly be possible (and relatively simple) to change ’ into a
word character for languages that don't use ’ for any other purpose. And if
no languages using a particular script use ’ for another purpose, then it
is particularly easy. (If you depend on language tagging, then any software
that doesn't maintain the language tagging will cause it to revert to the
default behavior.)

So does modern Greek use ’ for in trailing environments where people
wouldn't expect it to be included in word selection?

Mark


On Mon, Jan 28, 2019 at 8:49 AM James Tauber  wrote:

> On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ☕️  wrote:
>
>> But the question is how important those are in daily life. I'm not sure
>> why the double-click selection behavior is so much more of a problem for
>> Ancient Greek users than it is for the somewhat larger community of English
>> users. Word selection is not normally as important an operation as line
>> break, which does work as expected.
>>
>
> Even if they don't _really_ care about word selection, there are digital
> classicists who care even less about U+2019 being the preferred character
> which makes it harder for me to make my case :-)
>
> What triggered the question in my original post about tailoring the Word
> Boundary Rules was the statement in TR29 "A further complication is the use
> of the same character as an apostrophe and as a quotation mark. Therefore
> leading or trailing apostrophes are best excluded from the default
> definition of a word." Because Ancient Greek does not have that ambiguity,
> there's no need for the exclusion in that case. Immediately following that
> quote is a suggestion about tailoring for French and Italian which made we
> wonder if the "right" thing to do is to tailor the WBRs for Ancient Greek.
>
> I know you've said here (and in your original response to me) that you
> don't think it's worth it, but is WBR tailoring (the only|the best|a)
> technically correct way to achieve with U+2019 (in Ancient Greek) what
> people are abusing U+02BC for?
>
> James
>

Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Mark Davis ☕️ via Unicode

Note that this is no different than the reasonably common cases in English
such as «the boys’ books».
(you can try various combinations in
http://unicode.org/cldr/utility/list-unicodeset.jsp)

There are certainly cases that are suboptimal in word selection. As another
example, «re-iterate» seems like it should not break around hyphens, but on
the other hand in «an out-of-the-box experience» it seems like they should.
Expecting people to type in hard-to-find invisible characters just to
correct double-click is not a realistic expectation. Short of a dictionary
or ML lookup, there is no good way to distinguish certain tricky cases.
(And that probably needs more context, to distinguish «Ted was lyin’ to her
mother.» from «She said ‘Ted was lyin’ to her mother.».)

But the question is how important those are in daily life. I'm not sure why
the double-click selection behavior is so much more of a problem for
Ancient Greek users than it is for the somewhat larger community of English
users. Word selection is not normally as important an operation as line
break, which does work as expected.

Mark

On Sun, Jan 27, 2019 at 8:13 PM James Tauber via Unicode <
unicode@unicode.org> wrote:

> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
>
>> Except the Uniocde-compliant processes aren't required to follow the
>> scheme of TR27 Unicode Text Segmentation.  However, it is only required
>> to select the whole word because the U+2019 is followed by a letter.
>> TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
>> as two 'words') and U+02BC (interpret as one word).  The GTK-based
>> email client I'm using has that difference, but also fails with
>> "don't" unless one uses U+02BC.
>>
>> However LibreOffice treats "don't" as a single word for U+0027, U+02BC
>> and U+2019, but "dogs'" as a single word only for U+02BC.  This
>> complies with TR27.  I'm not surprised, as LibreOffice does use or has
>> used ICU.
>>
>
> This comes back to my original question that started this thread. Many
> people creating Ancient Greek digital resources use U+02BC seemingly
> because of incorrect word-breaking with *word-final* U+2019 (which is the
> only time it occurs in Ancient Greek and always marking elision, never as
> the end of a quotation).
>
> I am trying to write guidelines as to why they should use U+2019. I'm
> convinced it's technically the right code point to use but am wanting to
> get my facts straight about how to address the word-breaking issue
> (specifically for word-final U+2019 in Ancient Greek, to be clear). In my
> original post, I asked if a language-specific tailoring of the text
> segmentation algorithm was the solution but no one here has agreed so far.
>
> Here's a concrete example from Smyth's Grammar:
>
> γένοιτ’ ἄν
>
> Double-clicking on the first word should select the U+2019 as well.
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the
> Terminal or here in Gmail on Chrome.
>
> To be clear: when I say "should" I mean that that is the expectation
> classicists have and the failure to meet it is why some of them insist on
> using U+02BC.
>
> I'm happy if the answer is "use U+2019 and go get your text segmentation
> implementations fixed"[2] but am looking for confirmation of that.
>
> James
>
> [1] To be honest, I was impressed Pages got it right.
> [2] In the same spirit as "if certain combining character combinations
> don't work, the solution is not to add precomposed characters, it's to
> improve the fonts" or "tonos and oxia are the same and if they look
> different, it's the fault of your font".
>
>
>
>
>

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Mark Davis ☕️ via Unicode

> breaking selection for "d'Artagnan" or "can't" into two is overly fussy.

True, and that is not what U+2019 does; it does not break medially.

Mark


On Fri, Jan 25, 2019 at 11:07 PM Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 1/25/2019 9:39 AM, James Tauber via Unicode wrote:
>
> Thank you, although the word break does still affect things like
> double-clicking to select.
>
> And people do seem to want to use U+02BC for this reason (and I'm trying
> to articulate why that isn't what U+02BC is meant for).
>
> For normal edition operations, breaking selection for "d'Artagnan" or
> "can't" into two is overly fussy.
>
> No wonder people get frustrated.
>
> A./
>
> James
>
> On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ☕️  wrote:
>
>> U+2019 is normally the character used, except where the ’ is considered a
>> letter. When it is between letters it doesn't cause a word break, but
>> because it is also a right single quote, at the end of words there is a
>> break. Thus in a phrase like «tryin’ to go» there is a word break after the
>> n, because one can't tell.
>>
>> So something like "δ’ αρχαια" (picking a phrase at random) would have a
>> word break after the delta.
>>
>> Word break:
>> δ’ αρχαια
>>
>> However, there is no *line break* between them (which is the more
>> important operation in normal usage). Probably not worth tailoring the word
>> break.
>>
>> Line break:
>> δ’ αρχαια
>>
>> Mark
>>
>>
>> On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> There seems some debate amongst digital classicists in whether to use
>>> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
>>> elision. (e.g. δ’ for δέ preceding a word starting with a vowel).
>>>
>>> It seems to me that U+2019 is the technically correct choice per the
>>> Unicode Standard but it is not without at least one problem: default word
>>> breaking rules.
>>>
>>> I'm trying to provide guidelines for digital classicists in this regard.
>>>
>>> Is it correct to say the following:
>>>
>>> 1) U+2019 is the correct character to use for the apostrophe in Ancient
>>> Greek when marking elision.
>>> 2) U+02BC is a misuse of a modifier for this purpose
>>> 3) However, use of U+2019 (unlike U+02BC) means the default Word
>>> Boundary Rules in UAX#29 will (incorrectly) exclude the apostrophe from the
>>> word token
>>> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules
>>> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph
>>> cluster with the previous letter
>>> 5) The correct solution is to tailor the Word Boundary Rules in the case
>>> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
>>> have the same ambiguity problems with the single quotation mark as in
>>> English as it should not be used as a quotation mark in Ancient Greek)
>>>
>>> Many thanks in advance.
>>>
>>> James
>>>
>>
>
> --
> *James Tauber*
> Greek Linguistics: https://jktauber.com/
> Music Theory: https://modelling-music.com/
> Digital Tolkien: https://digitaltolkien.com/
>
> Twitter: @jtauber
>
>
>

Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Mark Davis ☕️ via Unicode

U+2019 is normally the character used, except where the ’ is considered a
letter. When it is between letters it doesn't cause a word break, but
because it is also a right single quote, at the end of words there is a
break. Thus in a phrase like «tryin’ to go» there is a word break after the
n, because one can't tell.

So something like "δ’ αρχαια" (picking a phrase at random) would have a
word break after the delta.

Word break:
δ’ αρχαια

However, there is no *line break* between them (which is the more important
operation in normal usage). Probably not worth tailoring the word break.

Line break:
δ’ αρχαια

Mark

On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
unicode@unicode.org> wrote:

> There seems some debate amongst digital classicists in whether to use
> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
> elision. (e.g. δ’ for δέ preceding a word starting with a vowel).
>
> It seems to me that U+2019 is the technically correct choice per the
> Unicode Standard but it is not without at least one problem: default word
> breaking rules.
>
> I'm trying to provide guidelines for digital classicists in this regard.
>
> Is it correct to say the following:
>
> 1) U+2019 is the correct character to use for the apostrophe in Ancient
> Greek when marking elision.
> 2) U+02BC is a misuse of a modifier for this purpose
> 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary
> Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word
> token
> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules in
> UAX#29 will (incorrectly) include the apostrophe as part of a glyph cluster
> with the previous letter
> 5) The correct solution is to tailor the Word Boundary Rules in the case
> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
> have the same ambiguity problems with the single quotation mark as in
> English as it should not be used as a quotation mark in Ancient Greek)
>
> Many thanks in advance.
>
> James
>

Re: The encoding of the Welsh flag

2018-11-21 Thread Mark Davis ☕️ via Unicode

We have gotten requests for this, but the stumbling block is the lack of an
official N. Ireland document describing what the official flag is and
should look like.


“However, whilst England (St George’s Cross) Scotland (St Andrew’s Cross)
and Wales (The Dragon) have individual regional flags, the Flags Institute
in London confirms that Northern Ireland has no official regional flag.”
https://www.newsletter.co.uk/news/new-northern-ireland-flag-should-be-created-says-lord-kilclooney-1-5753950


Should the N. Irish decide on a flag, I don't foresee any problem adding it.

Mark


On Wed, Nov 21, 2018 at 7:04 PM Ken Whistler via Unicode <
unicode@unicode.org> wrote:

> Michael,
>
> On 11/21/2018 9:38 AM, Michael Everson via Unicode wrote:
> > What really annoys me about this is that there is no flag for Northern
> Ireland. The folks at CLDR did not think to ask either the UK or the Irish
> representatives to SC2 about this.
>
> Neither CLDR-TC nor SC2 has any jurisdiction here, so this is rather non
> sequitur.
>
> If you or Andrew West or anyone else is interested in pursuing an emoji
> tag sequence for an emoji flag for Northern Ireland, then that should be
> done by submitting a proposal, with justification, to the Emoji
> Subcommittee, which *does* have jurisdiction.
>
> https://unicode.org/emoji/proposals.html
>
> See in particular, Section M of the selection criteria.
>
> --Ken
>
>
>

Re: UCA unnecessary collation weight 0000

2018-11-04 Thread Mark Davis ☕️ via Unicode

Philippe, I agree that we could have structured the UCA differently. It
does make sense, for example, to have the weights be simply decimal values
instead of integers. But nobody is going to go through the substantial work
of restructuring the UCA spec and data file unless there is a very strong
reason to do so. It takes far more time and effort than people realize to
change in the algorithm/data while making sure that everything lines up
without inadvertent changes being introduced.

It is just not worth the effort. There are so, so, many things we can do in
Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
benefit.

You can continue flogging this horse all you want, but I'm muting this
thread (and I suspect I'm not the only one).

Mark


On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> Le ven. 2 nov. 2018 à 22:27, Ken Whistler  a écrit :
>
>>
>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>
>> I was replying not about the notational repreentation of the DUCET data
>> table (using [....] unnecessarily) but about the text of UTR#10 itself.
>> Which remains highly confusive, and contains completely unnecesary steps,
>> and just complicates things with absoiluytely no benefit at all by
>> introducing confusion about these "".
>>
>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>> you are introducing to the unicode list in the course of this discussion.
>>
>>
>> UTR#10 still does not explicitly state that its use of "" does not
>> mean it is a valid "weight", it's a notation only
>>
>> No, it is explicitly a valid weight. And it is explicitly and normatively
>> referred to in the specification of the algorithm. See UTS10-D8 (and
>> subsequent definitions), which explicitly depend on a definition of "A
>> collation weight whose value is zero." The entire statement of what are
>> primary, secondary, tertiary, etc. collation elements depends on that
>> definition. And see the tables in Section 3.2, which also depend on those
>> definitions.
>>
>> (but the notation is used for TWO distinct purposes: one is for
>> presenting the notation format used in the DUCET
>>
>> It is *not* just a notation format used in the DUCET -- it is part of the
>> normative definitional structure of the algorithm, which then percolates
>> down into further definitions and rules and the steps of the algorithm.
>>
>
> I insist that this is NOT NEEDED at all for the definition, it is
> absolutely NOT structural. The algorithm still guarantees the SAME result.
>
> It is ONLY used to explain the format of the DUCET and the fact the this
> format does NOT use  as a valid weight, ans os can use it as a notation
> (in fact only a presentational feature).
>
>
>> itself to present how collation elements are structured, the other one is
>> for marking the presence of a possible, but not always required, encoding
>> of an explicit level separator for encoding sort keys).
>>
>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>> is not part of the *notation* for collation elements, but instead is a
>> magic value chosen for the level separator precisely because zero values
>> from the collation elements are removed during sort key construction, so
>> that zero is then guaranteed to be a lower value than any remaining weight
>> added to the sort key under construction. This part of the algorithm is not
>> rocket science, by the way!
>>
>
> Here again you make a confusion: a sort key MAY use them as separators if
> it wants to compress keys by reencoding weights per level: that's the only
> case where you may want to introduce an encoding pattern starting with 0,
> while the rest of the encoding for weights in that level must using
> patterns not starting by this 0 (the number of bits to encode this 0 does
> not matter: it is only part of the encoding used on this level which does
> not necessarily have to use 16-bit code units per weight.
>
>>
>> Even the example tables can be made without using these "" (for
>> example in tables showing how to build sort keys, it can present the list
>> of weights splitted in separate columns, one column per level, without any
>> "". The implementation does not necessarily have to create a buffer
>> containing all weight values in a row, when separate buffers for each level
>> is far superior (and even more efficient as it can save space in memory).
>>
>> The UCA doesn't *require* you to do anything particular in your own
>> implementation, other than come up with the same results for string
>> comparisons.
>>
> Yes I know, but the algorithm also does not require me to use these
> invalid  pseudo-weights, that the algorithm itself will always discard
> (in a completely needless step)!
>
>
>> That is clearly stated in the conformance clause of UTS #10.
>>
>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>>
>> The step "S3.2" in the UCA

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode

The table is the way it is because it is easier to process (and comprehend)
when the first field is always the primary weight, second is always the
secondary, etc.

Go ahead and transform the input DUCET files as you see fit. The "should be
removed" is your personal preference. Unless we hear strong demand
otherwise from major implementers, people have better things to do than
change their parsers to suit your preference.

Mark


On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy  wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of  required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to create an
> equivalent collation order. these steps are completely unnecessary and
> should be removed.
>
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️  a écrit :
>
>> You may not like the format of the data, but you are not bound to it. If
>> you don't like the data format (eg you want [.0021.0002] instead of
>> [..0021.0002]), you can transform it however you want as long as you
>> get the same answer, as it says here:
>>
>> http://unicode.org/reports/tr10/#Conformance
>> “The Unicode Collation Algorithm is a logical specification.
>> Implementations are free to change any part of the algorithm as long as any
>> two strings compared by the implementation are ordered the same as they
>> would be by the algorithm as specified. Implementations may also use a
>> different format for the data in the Default Unicode Collation Element
>> Table. The sort key is a logical intermediate object: if an implementation
>> produces the same results in comparison of strings, the sort keys can
>> differ in format from what is specified in this document. (See Section 9,
>> Implementation Notes.)”
>>
>>
>> That is what is done, for example, in ICU's implementation. See
>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>> collation elements" and "sort keys" to see the transformed collation
>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>
>> a =>[29,05,_05] => 29 , 05 , 05 .
>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>> à => 
>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>> À => 
>>
>> Mark
>>
>>
>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> As well the step 2 of the algorithm speaks about a single "array" of
>>> collation elements. Actually it's best to create one separate array per
>>> level, and append weights for each level in the relevant array for that
>>> level.
>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>> elements in section 10.1, or variable weighting in section 4.
>>>
>>> This also means that for fast string compares, the primary weights can
>>> be processed on the fly (without needing any buffering) is the primary
>>> weights are different between the two strings (including when one or both
>>> of the two strings ends, and the secondary weights or tertiary weights
>>> detected until then have not found any weight higher than the minimum
>>> weight value for each level).
>>> Otherwise:
>>> - the first secondary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a
>>> secondary  buffer  .
>>> - the first tertiary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a tertiary
>>> buffer.
>>> - and so on for higher levels (each buffer just needs to keep a counter,
>>> when it's first used, indicating how many weights were not buffered while
>>> processing and counting the primary weights, because all these weights were
>>> all equal to the minimum value for the relevant level)
>>> - these secondary/tertiary/etc. buffers will only be used once you reach
>>> the end of the two strings when processing the primary level and no
>>> difference was found: you'll start by comparing the initial counters in
>>> these buffers and the buffer that has the largest counter value is
>>> necessarily for the smaller compared string. If both counters are equal,
>>> then you start comparing the weights stored in each buffer, until one of
>>> the buffers ends before another (the shorter buffer is for the smaller
>>> compared string). If both weight buffers reach the end, you use the next
>&

Re: UCA unnecessary collation weight 0000

2018-11-02 Thread Mark Davis ☕️ via Unicode

You may not like the format of the data, but you are not bound to it. If
you don't like the data format (eg you want [.0021.0002] instead of
[..0021.0002]), you can transform it however you want as long as you
get the same answer, as it says here:

http://unicode.org/reports/tr10/#Conformance
“The Unicode Collation Algorithm is a logical specification.
Implementations are free to change any part of the algorithm as long as any
two strings compared by the implementation are ordered the same as they
would be by the algorithm as specified. Implementations may also use a
different format for the data in the Default Unicode Collation Element
Table. The sort key is a logical intermediate object: if an implementation
produces the same results in comparison of strings, the sort keys can
differ in format from what is specified in this document. (See Section 9,
Implementation Notes.)”


That is what is done, for example, in ICU's implementation. See
http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
collation elements" and "sort keys" to see the transformed collation
elements (from the DUCET + CLDR) and the resulting sort keys.

a =>[29,05,_05] => 29 , 05 , 05 .
a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
à => 
A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
À => 

Mark


On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
unicode@unicode.org> wrote:

> As well the step 2 of the algorithm speaks about a single "array" of
> collation elements. Actually it's best to create one separate array per
> level, and append weights for each level in the relevant array for that
> level.
> The steps S2.2 to S2.4 can do this, including for derived collation
> elements in section 10.1, or variable weighting in section 4.
>
> This also means that for fast string compares, the primary weights can be
> processed on the fly (without needing any buffering) is the primary weights
> are different between the two strings (including when one or both of the
> two strings ends, and the secondary weights or tertiary weights detected
> until then have not found any weight higher than the minimum weight value
> for each level).
> Otherwise:
> - the first secondary weight higher that the minimum secondary weght
> value, and all subsequent secondary weights must be buffered in a
> secondary  buffer  .
> - the first tertiary weight higher that the minimum secondary weght value,
> and all subsequent secondary weights must be buffered in a tertiary buffer.
> - and so on for higher levels (each buffer just needs to keep a counter,
> when it's first used, indicating how many weights were not buffered while
> processing and counting the primary weights, because all these weights were
> all equal to the minimum value for the relevant level)
> - these secondary/tertiary/etc. buffers will only be used once you reach
> the end of the two strings when processing the primary level and no
> difference was found: you'll start by comparing the initial counters in
> these buffers and the buffer that has the largest counter value is
> necessarily for the smaller compared string. If both counters are equal,
> then you start comparing the weights stored in each buffer, until one of
> the buffers ends before another (the shorter buffer is for the smaller
> compared string). If both weight buffers reach the end, you use the next
> pair of buffers built for the next level and process them with the same
> algorithm.
>
> Nowhere you'll ever need to consider any [.] weight which is just a
> notation in the format of the DUCET intended only to be readable by humans
> but never needed in any machine implementation.
>
> Now if you want to create sort keys this is similar except that you don"t
> have two strings to process and compare, all you want is to create separate
> arrays of weights for each level: each level can be encoded separately, the
> encoding must be made so that when you'll concatenate the encoded arrays,
> the first few encoded *bits* in the secondary or tertiary encodings cannot
> be larger or equal to the bits used by the encoding of the primary weights
> (this only limits how you'll encode the 1st weight in each array as its
> first encoding *bits* must be lower than the first bits used to encode any
> weight in previous levels).
>
> Nowhere you are required to encode weights exactly like their logical
> weight, this encoding is fully reversible and can use any suitable
> compression technics if needed. As long as you can safely detect when an
> encoding ends, because it encounters some bits (with lower values) used to
> start the encoding of one of the higher levels, the compression is safe.
>
> For each level, you can reserve only a single code used to "mark" the
> start of another higher level followed by some bits to indicate which level
> it is, then followed by the compressed code for the level made so that each
> weight is encoded by a code not starting by the reserved mark. That
> encoding "mark"

Re: Unicode String Models

2018-10-03 Thread Mark Davis ☕️ via Unicode

Mark


On Wed, Oct 3, 2018 at 3:01 PM Daniel Bünzli 
wrote:

> On 3 October 2018 at 09:17:10, Mark Davis ☕️ via Unicode (
> unicode@unicode.org) wrote:
>
> > There are two main choices for a scalar-value API:
> >
> > 1. Guarantee that the storage never contains surrogates. This is the
> > simplest model.
> > 2. Substitute U+FFFD for surrogates when the API returns code
> > points. This can be done where #1 is not feasible, such as where the API
> is
> > a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code
> units
> > that are not guaranteed to be UTF-16. The cost is extra tests on every
> code
> > point access.
>
> I'm not sure 2. really makes sense in pratice: it would mean you can't
> access scalar values
> which needs surrogates to be encoded.
>

Let me clear that up; I meant that "the underlying storage never contains
something that would need to be represented as a surrogate code point." Of
course, UTF-16 does need surrogate code units. What #1 would be excluding
in the case of UTF-16 would be unpaired surrogates. That is, suppose the
underlying storage is UTF-16 code units that don't satisfy #1.

0061 D83D DC7D 0061 D83D

A code point API would return for those a sequence of 4 values, the last of
which would be a surrogate code point.

0061, 0001F47D, 0061, D83D

A scalar value API would return for those also 4 values, but since we
aren't in #1, it would need to remap.

0061, 0001F47D, 0061, FFFD

>
> Also regarding 1. you can always defines an API that has this property
> regardless of the actual storage, it's only that your indexing operations
> might be costly as they do not directly map to the underlying storage array.


> That being said I don't think direct indexing/iterating for Unicode text
> is such an interesting operation due of course to the
> normalization/segmentation issues. Basically if your API provides them I
> only see these indexes as useful ways to define substrings. APIs that
> identify/iterate boundaries (and thus substrings) are more interesting due
> to the nature of Unicode text.
>

I agree that iteration is a very common case. But quite often
implementations need to have at least opaque indexes (as discussed).

>
> > If the programming language provides for such a primitive datatype, that
> is
> > possible. That would mean at a minimum that casting/converting to that
> > datatype from other numerical datatypes would require bounds-checking and
> > throwing an exception for values outside of [0x..0xD7FF
> > 0xE000..0x10].
>
> Yes. But note that in practice if you are in 1. above you usually perform
> this only at the point of decoding where you are already performing a lot
> of other checks. Once done you no longer need to check anything as long as
> the operations you perform on the values preserve the invariant. Also
> converting back to an integer if you need one is a no-op: it's the identity
> function.
>

If it is a real datatype, with strong guarantees that it *never* contains
values outside of [0x..0xD7FF 0xE000..0x10], then every conversion
from number will require checking. And in my experience, without a strong
guarantee the datatype is in practice pretty useless.


>
> The OCaml Uchar module does this. This is the interface:
>
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.mli
>
> which defines the type t as abstract and here is the implementation:
>
>   https://github.com/ocaml/ocaml/blob/trunk/stdlib/uchar.ml
>
> which defines the implementation of type t = int which means values of
> this type are an *unboxed* OCaml integer (and will be stored as such in say
> an OCaml array). However since the module system enforces type abstraction
> the only way of creating such values is to use the constants or the
> constructors (e.g. of_int) which all maintain the scalar value invariant
> (if you disregard the unsafe_* functions).
>
> Note that it would perfectly be possible to adopt a similar approach in C
> via a typedef though given C's rather loose type system a little bit more
> discipline would be required from the programmer (always go through the
> constructor functions to create values of the type).


That's the C motto: "requiring a 'bit more' discipline from programmers"

>


> Best,
>
> Daniel
>
>
>

Re: Unicode String Models

2018-10-03 Thread Mark Davis ☕️ via Unicode

Mark

On Tue, Oct 2, 2018 at 8:31 PM Daniel Bünzli 
wrote:

> On 2 October 2018 at 14:03:48, Mark Davis ☕️ via Unicode (
> unicode@unicode.org) wrote:
>
> > Because of performance and storage consideration, you need to consider
> the
> > possible internal data structures when you are looking at something as
> > low-level as strings. But most of the 'model's in the document are only
> > really distinguished by API, only the "Code Point model" discussions are
> > segmented by internal storage, as with "Code Point Model: UTF-32"
>
> I guess my gripe with the presentation of that document is that it
> perpetuates the problem of confusing "unicode characters" (or integers, or
> scalar values) and their *encoding* (how to represent these integers as
> byte sequences) which a source of endless confusion among programmers.
>
> This confusion is easy lifted once you explain that there exists certain
> integers, the scalar values, which are your actual characters and then you
> have different ways of encoding your characters; one can then explain that
> a surrogate is not a character per se, it's a hack and there's no point in
> indexing them except if you want trouble.
>
> This may also suggest another taxonomy of classification for the APIs,
> those in which you work directly with the character data (the scalar
> values) and those in which you work with an encoding of the actual
> character data (e.g. a JavaScript string).
>

Thanks for the feedback. It is worth adding a discussion of the issues,
perhaps something like:

A code-point-based API takes and returns int32's, although only a small
subset of the values are valid code points, namely 0x0..0x10. (In
practice some APIs may support returning -1 to signal an error or
termination, such as before or after the end of a string.) A surrogate code
point is one in U+D800..U+DFFF; these reflect a range of special code units
used in pairs in UTF-16 for representing code points above U+. A scalar
value is a code point that is not a surrogate.

A scalar-value API for immutable strings requires that no surrogate code
points are ever returned. In practice, the main advantage of that API is
that round-tripping to UTF-8/16 is guaranteed. Otherwise, a leaked
surrogate code point is relatively harmless: Unicode properties are devised
so that clients can essentially treat them as (permanently) unassigned
characters. Warning: an iterator should *never* avoid returning surrogate
code points by skipping them; that can cause security problems; see
https://www.unicode.org/reports/tr36/tr36-7.html#Substituting_for_Ill_Formed_Subsequences
and
https://www.unicode.org/reports/tr36/tr36-7.html#Deletion_of_Noncharacters.

There are two main choices for a scalar-value API:

   1. Guarantee that the storage never contains surrogates. This is the
   simplest model.
   2. Substitute U+FFFD for surrogates when the API returns code
   points. This can be done where #1 is not feasible, such as where the API is
   a shim on top of a (perhaps large) IO buffer of buffer of 16-bit code units
   that are not guaranteed to be UTF-16. The cost is extra tests on every code
   point access.

> > In reality, most APIs are not even going to be in terms of code points:
> > they will return int32's.
>
> That reality depends on your programming language. If the latter supports
> type abstraction you can define an abstract type for scalar values (whose
> implementation may simply be an integer). If you always go through the
> constructor to create these "integers" you can maintain the invariant that
> a value of this type is an integer in the ranges [0x;0xD7FF] and
> [0xE000;0x10]. Knowing this invariant holds is quite useful when you
> feed your "character" data to other processes like UTF-X encoders: it
> guarantees the correctness of their outputs regardless of what the
> programmer does.
>

If the programming language provides for such a primitive datatype, that is
possible. That would mean at a minimum that casting/converting to that
datatype from other numerical datatypes would require bounds-checking and
throwing an exception for values outside of [0x..0xD7FF
0xE000..0x10]. Most common-use programming languages that I know of
don't support that for primitives; the API would have to use a class, which
would be so very painful for performance/storage. If you (or others) know
of languages that do have such a cheap primitive datatype, that would be
worth mentioning!

> Best,
>
> Daniel
>
>
>

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode

Whether or not it is well suited, that's probably water under the bridge at
this point. Think of it as a jargon at this point; after all, there are
lots of cases like that: a "near miss" wasn't nearly a miss, it was nearly
a hit.

Mark


On Sun, Sep 9, 2018 at 10:56 AM Janusz S. Bień  wrote:

> On Sat, Sep 08 2018 at 18:36 +0200, Mark Davis ☕️ via Unicode wrote:
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> It's a good opportunity to propose a better term for "extended grapheme
> cluster", which usually are neither extended nor clusters, it's also not
> obvious that they are always graphemes.
>
> Cf.the earlier threads
>
> https://www.unicode.org/mail-arch/unicode-ml/y2017-m03/0031.html
> https://www.unicode.org/mail-arch/unicode-ml/y2016-m09/0040.html
>
> Best regards
>
> Janusz
>
> --
>  ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode

Mark


On Tue, Sep 11, 2018 at 12:17 PM Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Sat, Sep 8, 2018 at 7:36 PM Mark Davis ☕️ via Unicode
>  wrote:
> >
> > I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> * The Grapheme Cluster Model seems to have a couple of disadvantages
> that are not mentioned:
>   1) The subunit of string is also a string (a short string conforming
> to particular constraints). There's a need for *another* more atomic
> mechanism for examining the internals of the grapheme cluster string.
>

I did mention this.


>   2) The way an arbitrary string is divided into units when iterating
> over it changes when the program is executed on a newer version of the
> language runtime that is aware of newly-assigned codepoints from a
> newer version of Unicode.
>

Good point. I did mention the EGC definitions changing, but should point
out that if you have a string with unassigned characters in it, they may be
clustered on future versions. Will add.


>  * The Python 3.3 model mentions the disadvantages of memory usage
> cliffs but doesn't mention the associated perfomance cliffs. It would
> be good to also mention that when a string manipulation causes the
> storage to expand or contract, there's a performance impact that's not
> apparent from the nature of the operation if the programmer's
> intuition works on the assumption that the programmer is dealing with
> UTF-32.
>

The focus was on immutable string models, but I didn't make that clear.
Added some text.

>
>  * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM
> text node storage in Gecko, (I believe but am not 100% sure) V8 and,
> optionally, HotSpot
> (
> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A
> ).
> That is, text has UTF-16 semantics, but if the high half of every code
> unit in a string is zero, only the lower half is stored. This has
> properties analogous to the Python 3.3 model, except non-BMP doesn't
> expand to UTF-32 but uses UTF-16 surrogate pairs.
>

Thanks, will add.

>
>  * I think the fact that systems that chose UTF-16 or UTF-32 have
> implemented models that try to save storage by omitting leading zeros
> and gaining complexity and performance cliffs as a result is a strong
> indication that UTF-8 should be recommended for newly-designed systems
> that don't suffer from a forceful legacy need to expose UTF-16 or
> UTF-32 semantics.
>
>  * I suggest splitting the "UTF-8 model" into three substantially
> different models:
>
>  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> UTF-8-related operations are performed when ingesting byte-oriented
> data. Byte buffers and text buffers are type-wise ambiguous. Only
> iterating over byte data by code point gives the data the UTF-8
> interpretation. Unless the data is cleaned up as a side effect of such
> iteration, malformed sequences in input survive into output.
>
>  2) UTF-8 without full trust in ability to retain validity (the model
> of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> common UTF-8 model for C and C++, but I don't have evidence to back
> this up): When data is ingested with text semantics, it is converted
> to UTF-8. For data that's supposed to already be in UTF-8, this means
> replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> data is valid UTF-8 right after input. However, iteration by code
> point doesn't trust ability of other code to retain UTF-8 validity
> perfectly and has "else" branches in order not to blow up if invalid
> UTF-8 creeps into the system.
>
>  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> have a different type in the type system than byte buffers. To go from
> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> has been tagged as valid UTF-8, the validity is trusted completely so
> that iteration by code point does not have "else" branches for
> malformed sequences. If data that the type system indicates to be
> valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> language has a default "safe" side and an opt-in "unsafe" side. The
> unsafe side is for performing low-level operations in a way where the
> responsibility of upholding invariants is moved from the compiler to
> the programmer. It's impossible to violate the UTF-8 validity
> invariant using the safe part of the language.
>

Added a quote based on this; plea

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode

Mark

On Sun, Sep 9, 2018 at 3:42 PM Daniel Bünzli 
wrote:

> Hello,
>
> I find your notion of "model" and presentation a bit confusing since it
> conflates what I would call the internal representation and the API.
>
> The internal representation defines how the Unicode text is stored and
> should not really matter to the end user of the string data structure. The
> API defines how the Unicode text is accessed, expressed by what is the
> result of an indexing operation on the string. The latter is really what
> matters for the end-user and what I would call the "model".
>

Because of performance and storage consideration, you need to consider the
possible internal data structures when you are looking at something as
low-level as strings. But most of the 'model's in the document are only
really distinguished by API, only the "Code Point model" discussions are
segmented by internal storage, as with "Code Point Model: UTF-32"

> I think the presentation would benefit from making a clear distinction
> between the internal representation and the API; you could then easily
> summarize them in a table which would make a nice summary of the design
> space.
>

That's an interesting suggestion, I'll mull it over.

>
> I also think you are missing one API which is the one with ECG I would
> favour: indexing returns Unicode scalar values, internally be it whatever
> you wish UTF-{8,16,32} or a custom encoding. Maybe that's what you intended
> by the "Code Point Model: Internal 8/16/32" but that's not what it says,
> the distinction between code point and scalar value is an important one and
> I think it would be good to insist on it to clarify the minds in such
> documents.
>

In reality, most APIs are not even going to be in terms of code points:
they will return int32's. So not only are they not scalar values,
99.97% are not even code points. Of course, values above 10 or below 0
shouldn't ever be stored in strings, but in practice treating
non-scalar-value-code-points as "permanently unassigned" characters doesn't
really cause problems in processing.

> Best,
>
> Daniel
>
>
>

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode

Mark

On Sun, Sep 9, 2018 at 10:03 AM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sat, 8 Sep 2018 18:36:00 +0200
> Mark Davis ☕️ via Unicode  wrote:
>
> > I recently did some extensive revisions of a paper on Unicode string
> > models (APIs). Comments are welcome.
> >
> >
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
>
> Theoretically at least, the cost of indexing a big string by codepoint
> is negligible.  For example, cost of accessing the middle character is
> O(1)*, not O(n), where n is the length of the string.  The trick is to
> use a proportionately small amount of memory to store and maintain a
> partial conversion table from character index to byte index.  For
> example, Emacs claims to offer O(1) access to a UTF-8 buffer by
> character number, and I can't significantly fault the claim.
>
> *There may be some creep, but it doesn't matter for strings that can be
> stored within a galaxy.
>
> Of course, the coefficients implied by big-oh notation also matter.
> For example, it can be very easy to forget that a bubble sort is often
> the quickest sorting algorithm.
>

Thanks, added a quote from you on that; see if it looks ok.

> You keep muttering that a a sequence of 8-bit code units can contain
> invalid sequences, but often forget that that is also true of sequences
> of 16-bit code units.  Do emoji now ensure that confusion between
> codepoints and code units rapidly comes to light?
>

I didn't neglect that, had a [TBD] for it.

While UTF16 invalid unpaired surrogates don't complicate processing much if
they are treated as unassigned characters, allowing UTF8 invalid sequences
are more troublesome. See, for example, the convolutions needed in ICU
methods that allow ill-formed UTF8.

> You seem to keep forgetting that grapheme clusters are not how some
> people people work.  Does the English word 'café' contain the letter
> 'e'?  Yes or no?  I maintain that it does.  I can't help thinking that
> one might want to look for the letter 'ă' in Vietnamese and find it
> whatever the associated tone mark is.
>

I'm pretty familiar with the situation, thanks for asking.

Often you want to find out more about the components of grapheme clusters,
so you always need to be able to iterate through the code points it
contains. One might think that iterating by grapheme cluster is hiding
features of the text. For example, with *fox́* (fox\u{301}) it is easy to
find that the text contains an *x* by iterating through code points. But
code points often don't reveal their components: does the word
*también* contain
the letter *e*? A reasonable question, but iterating by code point rather
than grapheme cluster doesn't help, since it is typically encoded as a
single U+00E9. And even decomposing to NFD doesn't always help, as with
cases like *rødgrød*.

>
> You didn't discuss substrings.

I did. But if you mean a definition of substring that lets you access
internal components of substrings, I'm afraid that is quite a specialized
usage. One could do it, but it would burden down the general use case.

> I'm interested in how subsequences of
> strings are defined, as the concept of 'substring' isn't really Unicode
> compliant.  Again, expressing 'ă' as a subsequence of the Vietnamese
> word 'nặng' ought to be possible, whether one is using NFD (easier) or
> NFC.  (And there are alternative normalisations that are compatible
> with canonical equivalence.)  I'm most interested in subsequences X of a
> word W where W is the same as AXB for some strings A and B.

> Richard.
>
>

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode

Thanks, added a quote from you on that; see if it looks ok.

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan  wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore <
> unic...@unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>

Re: Unicode String Models

2018-10-02 Thread Mark Davis ☕️ via Unicode

Thanks to all for comments. Just revised the text in https://goo.gl/neguxb.

Mark


On Sat, Sep 8, 2018 at 6:36 PM Mark Davis ☕️  wrote:

> I recently did some extensive revisions of a paper on Unicode string
> models (APIs). Comments are welcome.
>
>
> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>
> Mark
>

Re: Unicode String Models

2018-09-11 Thread Mark Davis ☕️ via Unicode

These are all interesting and useful comments. I'll be responding once I
get a bit of free time, probably Friday or Saturday.

Mark


On Tue, Sep 11, 2018 at 4:16 AM Eli Zaretskii via Unicode <
unicode@unicode.org> wrote:

> > Date: Tue, 11 Sep 2018 13:12:40 +0300
> > From: Henri Sivonen via Unicode 
> >
> >  * I suggest splitting the "UTF-8 model" into three substantially
> > different models:
> >
> >  1) The UTF-8 Garbage In, Garbage Out model (the model of Go): No
> > UTF-8-related operations are performed when ingesting byte-oriented
> > data. Byte buffers and text buffers are type-wise ambiguous. Only
> > iterating over byte data by code point gives the data the UTF-8
> > interpretation. Unless the data is cleaned up as a side effect of such
> > iteration, malformed sequences in input survive into output.
> >
> >  2) UTF-8 without full trust in ability to retain validity (the model
> > of the UTF-8-using C++ parts of Gecko; I believe this to be the most
> > common UTF-8 model for C and C++, but I don't have evidence to back
> > this up): When data is ingested with text semantics, it is converted
> > to UTF-8. For data that's supposed to already be in UTF-8, this means
> > replacing malformed sequences with the REPLACEMENT CHARACTER, so the
> > data is valid UTF-8 right after input. However, iteration by code
> > point doesn't trust ability of other code to retain UTF-8 validity
> > perfectly and has "else" branches in order not to blow up if invalid
> > UTF-8 creeps into the system.
> >
> >  3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers
> > have a different type in the type system than byte buffers. To go from
> > a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data
> > has been tagged as valid UTF-8, the validity is trusted completely so
> > that iteration by code point does not have "else" branches for
> > malformed sequences. If data that the type system indicates to be
> > valid UTF-8 wasn't actually valid, it would be nasal demon time. The
> > language has a default "safe" side and an opt-in "unsafe" side. The
> > unsafe side is for performing low-level operations in a way where the
> > responsibility of upholding invariants is moved from the compiler to
> > the programmer. It's impossible to violate the UTF-8 validity
> > invariant using the safe part of the language.
>
> There's another model, the one used by Emacs.  AFAIU, it is different
> from all the 3 you describe above.  In Emacs, each raw byte belonging
> to a byte sequence which is invalid under UTF-8 is represented as a
> special multibyte sequence.  IOW, Emacs's internal representation
> extends UTF-8 with multibyte sequences it uses to represent raw bytes.
> This allows mixing stray bytes and valid text in the same buffer,
> without risking lossy conversions (such as those one gets under model
> 2 above).
>

Re: Unicode String Models

2018-09-09 Thread Mark Davis ☕️ via Unicode

Thanks, excellent comments. While it is clear that some string models have
more complicated structures (with their own pros and cons), my focus was on
simple internal structures. The focus was also on immutable strings — and
the tradeoffs for mutable ones can be quite different — and that needs to
be clearer. I'll add some material about those two areas (with pointers to
sources where possible).

Mark


On Sat, Sep 8, 2018 at 9:20 PM John Cowan  wrote:

> This paper makes the default assumption that the internal storage of a
> string is a featureless array.  If this assumption is abandoned, it is
> possible to get O(1) indexes with fairly low space overhead.  The Scheme
> language has recently adopted immutable strings called "texts" as a
> supplement to its pre-existing mutable strings, and the sample
> implementation for this feature uses a vector of either native strings or
> bytevectors (char[] vectors in C/Java terms).  I would urge anyone
> interested in the question of storing and accessing mutable strings to read
> the following parts of SRFI 135 at <
> https://srfi.schemers.org/srfi-135/srfi-135.html>:  Abstract, Rationale,
> Specification / Basic concepts, and Implementation.  In addition, the
> design notes at <https://github.com/larcenists/larceny/wiki/ImmutableTexts>,
> though not up to date (in particular, UTF-16 internals are now allowed as
> an alternative to UTF-8), are of interest: unfortunately, the link to the
> span API has rotted.
>
> On Sat, Sep 8, 2018 at 12:53 PM Mark Davis ☕️ via Unicore <
> unic...@unicode.org> wrote:
>
>> I recently did some extensive revisions of a paper on Unicode string
>> models (APIs). Comments are welcome.
>>
>>
>> https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#
>>
>> Mark
>>
>

Unicode String Models

2018-09-08 Thread Mark Davis ☕️ via Unicode

I recently did some extensive revisions of a paper on Unicode string models
(APIs). Comments are welcome.

https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit#

Mark

Re: Private Use areas (was: Re: Thoughts on working with the Emoji Subcommittee (was ...))

2018-08-20 Thread Mark Davis ☕️ via Unicode

> ... some people who would call a PUA solution either batty
> or crazy.

I don't think it is either batty or crazy. People can certainly use the PUA
to interchange text (assuming that they have downloaded fonts and keyboards
or some other input method beforehand), and
it
 can definitely serve as a proof of concept
. Plain symbols — with no interactions between them (like changing shape
with complex scripts), no combining/non-spacing marks, no case mappings,
and so on — are the best possible case for PUA.

The only caution I would give is that people shouldn't expect general
purpose software to do anything with PUA text that depends on character
properties.

Mark

On Mon, Aug 20, 2018 at 8:52 PM Doug Ewell via Unicode 
wrote:

> James Kass wrote:
>
> > As a caveat, some Unicode cognoscenti express disdain for the PUA, so
> > there would be some people who would call a PUA solution either batty
> > or crazy.
>
> I'm concerned that the constant "health warnings" about avoiding the PUA
> may have scared everyone away from this primary use case.
>
> Yes, you run the risk of someone else's PUA implementation colliding
> with yours. That's why you create a Private Use Agreement, and make sure
> it's prominently available to people who want to use your solution. It's
> not like there are hundreds of PUA schemes anyway.
>
> Yes, you will have to convert any existing data if the solution ever
> gets encoded in Unicode. That happened for Deseret and Shavian, and
> maybe others, and the sky didn't fall.
>
> People forget that it was the PUA in Shift-JIS, by Japanese mobile
> providers, that provided the platform for emoji to take off to such an
> extent that... well, we know the rest. If private-use is good enough for
> a legacy encoding, it ought to be good enough for Unicode.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Tales from the Archives

2018-08-19 Thread Mark Davis ☕️ via Unicode

You and Alan both raise good issues and make good points. I'd mention a
couple of others.

When we started Unicode, there were not a lot of alternatives to a
general-purpose discussion email list for internationalization, but now
there are many. Often the technical discussions are moved to more specific
forums. There are interesting discussions on the identification of Unicode
spoofing (because of look-alikes) on a variety of forums dealing with
security, for example. I suspect many of the font rendering issues have
widespread solutions now (as Alan notes) and that discussions of remaining
issues have shifted to forums on OpenType. There are some very intense
discussions of Mongolian model issues, but those also tend to be handled in
different venues. Work on ICU / CLDR also tend to take place in many cases
in the comments on particular tickets, rather than in email lists.

The work of the consortium has also broadened significantly beyond encoding
and issues closely related to encoding. Here's a slide to illustrate that.
(The first 24 slides in the deck are to give people some context and
perspective on what the Unicode Consortium does before focusing on a
narrower issue.)

https://docs.google.com/presentation/d/1QAyfwAn_0SZJ1yd0WiQgoJdG7djzDiq2Isb254ymDZc/edit#slide=id.g38b1fcd632_0_166

Mark


On Sun, Aug 19, 2018 at 5:06 PM Alan Wood via Unicode 
wrote:

> James
>
> I think you have answered your own question: nearly everything works
> "out-of-the-box".
>
> Unicode is just there, and most computer users have probably never heard
> of it.  I routinely produce web pages with English, French, Russian and
> Chinese text and a few symbols, and don't even think whether other people
> can see everything displayed properly.
>
> Long ago, the response to the question "Why can't I see character x" was
> often to install a copy of the Code2000 font and send the fee ($10 ?) to
> James Kass by airmail.
>
> These days, Windows 10 can display all of the major living languages (and
> I expect Macs can too, but I can't afford one now that I have retired).
>
> Some of the frequent posters have probably passed away, while others (like
> me) have got older, and slowed down and/or developed new interests.
>
> Best regards
>
> Alan Wood
> http://www.alanwood.net (Unicode, special characters, pesticide names)
>
>
> On Sunday, 19 August 2018, 03:05:41 GMT+1, James Kass via Unicode <
> unicode@unicode.org> wrote:
>
>
> http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML024/0180.html
>
> Back in 2000, William Overington asked about ligation for Latin and
> mentioned something about preserving older texts digitally.  John
> Cowan replied with some information about ZWJ/ZWNJ and I offered a
> link to a Unicode-based font, Junicode, which had (at that time)
> coverage for archaic letters already encoded, and which used the PUA
> for unencoded ligatures.
>
> At that time, OpenType support was primitive and not generally
> available.  If I'm not mistaken, the word "ligation" for typographic
> ligature forming had not yet been coined. IIRC John Hudson borrowed
> the medical word some time after that particular Unicode e-mail
> thread.  (One poster in that thread called it "ligaturing".)
>
> Peter Constable replied and explained clearly how ligation was
> expected to work for Latin in Unicode.  John Cowan posted again and
> augmented the information which Peter Constable had provided.  The
> information from Peter and John was instructional and helpful and
> furthered the education of at least one neophyte.
>
> Back then, display issues were on everyone's mind.  Many questions
> about display issues were posted to this list.  Unicode provided some
> novel methods of encoding complex scripts, such as for Indic, but
> those methods didn't actually work anywhere in the real world, so
> users stuck to the "ASCII-hack" fonts that actually did work.
>
> When questions about display issues and other technical aspects of
> Unicode were posted, experts from everywhere quickly responded with
> helpful pointers and explanations.
>
> Eighteen years pass, display issues have mostly gone away, nearly
> everything works "out-of-the-box", and list traffic has dropped
> dramatically.  Today's questions are usually either highly technical
> or emoji-related.
>
> Recent threads related to emoji included some questions and issues
> which remain unanswered in spite of the fact that there are list
> members who know the answers.
>
> This gives the impression that the Unicode public list has become
> passé.  That's almost as sad as looking down the archive posts, seeing
> the names of the posters, and remembering colleagues who no longer
> post.
>
> So I'm wondering what changed, but I don't expect an answer.
>
>

Re: Usage of emoji in coding contexts!

2018-08-09 Thread Mark Davis ☕️ via Unicode

Very amusing. But interesting how it catches your eye when scanning a list.

Mark

On Thu, Aug 9, 2018 at 7:37 AM, Shriramana Sharma via Unicode <
unicode@unicode.org> wrote:

> First time I'm seeing this (maybe others have seen this already):
>
> https://github.com/wei/pull
>
> Emoji being used in commit messages for classifying the nature of the
> commit – bug fixes, feature additions etc
>
> Now *that*'s a nice creative usage of emoji IMO…
>
> I see they haven't used them always as the actual emoji characters but
> sometimes as :coloned-tags: (or what do you call it) but I presume the
> GitHub system will convert it to the actual characters before
> displaying…
>
> --
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा ူ၆ိျိါအူိ၆ါး
>
>

Re: Diacritic marks in parentheses

2018-07-26 Thread Mark Davis ☕️ via Unicode

But Asmus, think of how easy it would be to read:

  Ein⁽ᵉ⁾ A⁽¨⁾rzt⁽ⁱⁿ⁾ hat eine⁽ⁿ⁾ Studenti⁽ᵉ⁾n gesehen.

Mark

On Thu, Jul 26, 2018 at 2:15 PM, Mark Davis ☕️  wrote:

> 藍
>
> Mark
>
> On Thu, Jul 26, 2018 at 1:57 PM, Asmus Freytag via Unicode <
> unicode@unicode.org> wrote:
>
>> On 7/26/2018 9:27 AM, Markus Scherer via Unicode wrote:
>>
>> I would not expect for Ä+combining () above = Ä᪻ to look right except
>> with specialized fonts.
>> http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB==0
>>
>> Even if it worked widely, I think it would be confusing.
>> I think you are best off writing Arzt/Ärztin.
>>
>>
>> Why do something simple and unambiguous, when you can to something that's
>> technologically complex, looks unfamiliar to readers and is likely to be
>> misunderstood?
>>
>> :)
>>
>> A./
>>
>>
>

Re: Diacritic marks in parentheses

2018-07-26 Thread Mark Davis ☕️ via Unicode

藍

Mark

On Thu, Jul 26, 2018 at 1:57 PM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 7/26/2018 9:27 AM, Markus Scherer via Unicode wrote:
>
> I would not expect for Ä+combining () above = Ä᪻ to look right except with
> specialized fonts.
> http://demo.icu-project.org/icu-bin/nbrowser?t=%C3%84%5Cu1ABB==0
>
> Even if it worked widely, I think it would be confusing.
> I think you are best off writing Arzt/Ärztin.
>
>
> Why do something simple and unambiguous, when you can to something that's
> technologically complex, looks unfamiliar to readers and is likely to be
> misunderstood?
>
> :)
>
> A./
>
>

Re: Missing UAX#31 tests?

2018-07-14 Thread Mark Davis ☕️ via Unicode

Not to worry, these things happen to the best of us. Just glad the root of
the problem was found.

Mark

Mark

On Sat, Jul 14, 2018 at 5:51 PM, Karl Williamson 
wrote:

> On 07/09/2018 02:11 PM, Karl Williamson via Unicode wrote:
>
>> On 07/08/2018 03:21 AM, Mark Davis ☕️ wrote:
>>
>>> I'm surprised that the tests for 11.0 passed for a 10.0 implementation,
>>> because the following should have triggered a difference for WB. Can you
>>> check on this particular case?
>>>
>>> ÷ 0020 × 0020 ÷#÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷
>>> [0.3]
>>>
>>
>> I'm one of the people who advocated for this change, and I had already
>> tailored our implementation of 10.0 to not break between horizontal white
>> space, so it's actually not surprising that this rule didn't break
>>
>>>
>>>
> It turns out that the fault was all mine; the Unicode 11.0 tests were
> failing on a 10.0 implementation.  I'm sorry for starting this red herring
> thread.
>
> If you care to know the details, read on.
>
> The code that runs the tests knows what version of the UCD it is using,
> and it knows what version of the UAX boundary algorithms it is using. If
> these differ, it emits a warning about the discrepancy, and expects that
> there are going to be many test failures, so it marks all failing ones as
> 'To do' which suppresses their output, so as to not distract from any other
> failures that have been introduced by using the new UCD version.  (Updating
> the algorithm comes last.)
>
> The solution for the future is to change the warning about the discrepancy
> to note that the failing boundary algorithm tests are suppressed.  This
> will clue me (or whoever) in that all is not necessarily well.
>
>
>
>>> About the testing:
>>>
>>> The tests are generated so that they go all the combinations of pairs,
>>> and some combinations of triples. The generated test cases use a sample
>>> from each partition of characters, to cut down on the file size to a
>>> reasonable level. That also means that some changes in the rules don't
>>> cause changes in the test results. Because it is not possible to test every
>>> combination, so there is also provision for additional test cases, such as
>>> those at the end of the files, eg:
>>>
>>> https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
>>> https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html
>>>
>>> We should extend those each time to make sure we cover combinations that
>>> aren't covered by pairs. There were some additions to that end; if they
>>> didn't cover enough cases, then we can look at your experience to add more.
>>>
>>> I can suggest two strategies for further testing:
>>>
>>> 1. To do a full test, for each row check every combinations obtained by
>>> replacing each sample character by every other character in its
>>> partition. Eg for the above line that would mean testing every >> WSegSpace> sequence.
>>>
>>> 2. Use a monkey test against ICU. That is, generate random combinations
>>> of characters from different partitions and check that ICU and your
>>> implementation are in sync.
>>>
>>> 3. During the beta period, test your previous-version with the new test
>>> files. If there are no failures, yet there are changes in the rules, then
>>> raise that issue during the beta period so we can add tests.
>>>
>>
>> I actually did this, and as I recall, did find some test failures.  In
>> retrospect, I must have screwed up somehow back then.  I was under tight
>> deadline pressure, and as a result, did more cursory beta testing than
>> normal.
>>
>>>
>>> 4. If possible, during the beta period upgrade your implementation and
>>> test against the new and old test files.
>>>
>>
>>
>>> Anyone else have other suggestions for testing?
>>>
>>> Mark
>>>
>>>
>> As an aside, a release or two ago, I implemented SB, and someone
>> immediately found a bug, and accused me of releasing software that had not
>> been tested at all.  He had looked through the test suite and not found
>> anything that looked like it was testing that.  But he failed to find the
>> test file which bundled up all your tests, in a manner he was not
>> accustomed to, so it was easy for him to overlook.  The bug only manifested
>> itself in longer runs of characters than your pairs and triples tested.  I
>> looked at it, and your SB

Re: Handling emoji

2018-07-14 Thread Mark Davis ☕️ via Unicode

Just fixed the one you found, Philippe...

Mark

On Sat, Jul 14, 2018 at 2:51 PM, Mark Davis ☕️  wrote:

> Thanks for the feedback, Philippe.
>
> I haven't fixed that one yet, but added some more text (thanks to Ben
> Hamilton!) and an acknowledgments section.
>
>
>
> Mark
>
> On Sat, Jul 14, 2018 at 12:06 PM, Philippe Verdy 
> wrote:
>
>> Hello Mark,
>>
>> In your document (https://docs.google.com/docum
>> ent/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview), The last
>> code segment has bugs:
>>
>>
>>
>>
>> *ULocale danishLocale = ULocale.forLanguageTag("da");Collator
>> danishAndEmoji = new RuleBasedCollator( ((RuleBasedCollator)
>> Collator.getInstance(locale1)).getRules() + ((RuleBasedCollator)
>> Collator.getInstance(locale2)).getRules());*where locale1 and locale2
>> are undefined. I suppose they are danishLocale, defined here, and
>> emojiLocale defined previously as:
>>
>>
>> *ULocale emojiLocale = ULocale.forLanguageTag("und-u-co-emoji");*But I'm
>> not sure of their order (which one of the two defined (named) locales is
>> locale1  or locale2.
>>
>> Philippe.
>>
>> 2018-07-13 20:33 GMT+02:00 Mark Davis ☕️ via Unicode > >:
>>
>>> Put together a doc about this; suggestions for improvement are welcome.
>>>
>>> https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1Dy
>>> APPZnwILUH2_03UL6Jo/preview
>>>
>>> Mark
>>>
>>
>>
>

Re: Handling emoji

2018-07-14 Thread Mark Davis ☕️ via Unicode

Thanks for the feedback, Philippe.

I haven't fixed that one yet, but added some more text (thanks to Ben
Hamilton!) and an acknowledgments section.



Mark

On Sat, Jul 14, 2018 at 12:06 PM, Philippe Verdy  wrote:

> Hello Mark,
>
> In your document (https://docs.google.com/docum
> ent/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview), The last
> code segment has bugs:
>
>
>
>
> *ULocale danishLocale = ULocale.forLanguageTag("da");Collator
> danishAndEmoji = new RuleBasedCollator( ((RuleBasedCollator)
> Collator.getInstance(locale1)).getRules() + ((RuleBasedCollator)
> Collator.getInstance(locale2)).getRules());*where locale1 and locale2 are
> undefined. I suppose they are danishLocale, defined here, and emojiLocale
> defined previously as:
>
>
> *ULocale emojiLocale = ULocale.forLanguageTag("und-u-co-emoji");*But I'm
> not sure of their order (which one of the two defined (named) locales is
> locale1  or locale2.
>
> Philippe.
>
> 2018-07-13 20:33 GMT+02:00 Mark Davis ☕️ via Unicode 
> :
>
>> Put together a doc about this; suggestions for improvement are welcome.
>>
>> https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1Dy
>> APPZnwILUH2_03UL6Jo/preview
>>
>> Mark
>>
>
>

Handling emoji

2018-07-13 Thread Mark Davis ☕️ via Unicode

Put together a doc about this; suggestions for improvement are welcome.

https://docs.google.com/document/d/1pC7N32TnmDr2xzFW4HscA1DyAPPZnwILUH2_03UL6Jo/preview

Mark

Re: Missing UAX#31 tests?

2018-07-09 Thread Mark Davis ☕️ via Unicode

Thanks, Karl.

Mark

On Mon, Jul 9, 2018 at 10:11 PM, Karl Williamson 
wrote:

> On 07/08/2018 03:21 AM, Mark Davis ☕️ wrote:
>
>> I'm surprised that the tests for 11.0 passed for a 10.0 implementation,
>> because the following should have triggered a difference for WB. Can you
>> check on this particular case?
>>
>> ÷ 0020 × 0020 ÷#÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷
>> [0.3]
>>
>
> I'm one of the people who advocated for this change, and I had already
> tailored our implementation of 10.0 to not break between horizontal white
> space, so it's actually not surprising that this rule didn't break
>
>>
>>
>> About the testing:
>>
>> The tests are generated so that they go all the combinations of pairs,
>> and some combinations of triples. The generated test cases use a sample
>> from each partition of characters, to cut down on the file size to a
>> reasonable level. That also means that some changes in the rules don't
>> cause changes in the test results. Because it is not possible to test every
>> combination, so there is also provision for additional test cases, such as
>> those at the end of the files, eg:
>>
>> https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
>> https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html
>>
>> We should extend those each time to make sure we cover combinations that
>> aren't covered by pairs. There were some additions to that end; if they
>> didn't cover enough cases, then we can look at your experience to add more.
>>
>> I can suggest two strategies for further testing:
>>
>> 1. To do a full test, for each row check every combinations obtained by
>> replacing each sample character by every other character in its
>> partition. Eg for the above line that would mean testing every > WSegSpace> sequence.
>>
>> 2. Use a monkey test against ICU. That is, generate random combinations
>> of characters from different partitions and check that ICU and your
>> implementation are in sync.
>>
>> 3. During the beta period, test your previous-version with the new test
>> files. If there are no failures, yet there are changes in the rules, then
>> raise that issue during the beta period so we can add tests.
>>
>
> I actually did this, and as I recall, did find some test failures.  In
> retrospect, I must have screwed up somehow back then.  I was under tight
> deadline pressure, and as a result, did more cursory beta testing than
> normal.
>
>>
>> 4. If possible, during the beta period upgrade your implementation and
>> test against the new and old test files.
>>
>
>
>> Anyone else have other suggestions for testing?
>>
>> Mark
>>
>>
> As an aside, a release or two ago, I implemented SB, and someone
> immediately found a bug, and accused me of releasing software that had not
> been tested at all.  He had looked through the test suite and not found
> anything that looked like it was testing that.  But he failed to find the
> test file which bundled up all your tests, in a manner he was not
> accustomed to, so it was easy for him to overlook.  The bug only manifested
> itself in longer runs of characters than your pairs and triples tested.  I
> looked at it, and your SB tests still seemed reasonable, and I should not
> expect a more complete series than you furnished.
>
>>
>>
>> Mark
>> //
>>
>> On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode <
>> unicode@unicode.org <mailto:unicode@unicode.org>> wrote:
>>
>> I am working on upgrading from Unicode 10 to Unicode 11.
>>
>> I used all the new files.
>>
>> The algorithms for some of the boundaries, like GCB and WB, have
>> changed so that some of the property values no longer have code
>> points associated with them.
>>
>> I ran the tests furnished in 11.0 for these boundaries, without
>> having changed the algorithms from earlier releases.  All passed 100%.
>>
>> Unless I'm missing something, that indicates that the tests
>> furnished in 11.0 do not contain instances that exercise these
>> changes.  My guess is that the 10.0 tests were also deficient.
>>
>> I have been relying on the UCD to furnish tests that have enough
>> coverage to sufficiently exercise the algorithms that are specified
>> in UAX 31, but that appears to have been naive on my part
>>
>>
>>
>

Re: Missing UAX#31 tests?

2018-07-08 Thread Mark Davis ☕️ via Unicode

PS, although the title was "Missing UAX#31 tests?", I assumed you were
talking about http://unicode.org/reports/tr29/

Mark

On Sun, Jul 8, 2018 at 11:21 AM, Mark Davis ☕️  wrote:

> I'm surprised that the tests for 11.0 passed for a 10.0 implementation,
> because the following should have triggered a difference for WB. Can you
> check on this particular case?
>
> ÷ 0020 × 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷
> [0.3]
>
> About the testing:
>
> The tests are generated so that they go all the combinations of pairs, and
> some combinations of triples. The generated test cases use a sample from
> each partition of characters, to cut down on the file size to a reasonable
> level. That also means that some changes in the rules don't cause changes
> in the test results. Because it is not possible to test every
> combination, so there is also provision for additional test cases, such as
> those at the end of the files, eg:
>
> https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
> https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html
>
> We should extend those each time to make sure we cover combinations that
> aren't covered by pairs. There were some additions to that end; if they
> didn't cover enough cases, then we can look at your experience to add more.
>
> I can suggest two strategies for further testing:
>
> 1. To do a full test, for each row check every combinations obtained by
> replacing each sample character by every other character in its
> partition. Eg for the above line that would mean testing every  WSegSpace> sequence.
>
> 2. Use a monkey test against ICU. That is, generate random combinations of
> characters from different partitions and check that ICU and your
> implementation are in sync.
>
> 3. During the beta period, test your previous-version with the new test
> files. If there are no failures, yet there are changes in the rules, then
> raise that issue during the beta period so we can add tests.
>
> 4. If possible, during the beta period upgrade your implementation and
> test against the new and old test files.
>
> Anyone else have other suggestions for testing?
>
> Mark
>
>
>
>
> Mark
>
> On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode <
> unicode@unicode.org> wrote:
>
>> I am working on upgrading from Unicode 10 to Unicode 11.
>>
>> I used all the new files.
>>
>> The algorithms for some of the boundaries, like GCB and WB, have changed
>> so that some of the property values no longer have code points associated
>> with them.
>>
>> I ran the tests furnished in 11.0 for these boundaries, without having
>> changed the algorithms from earlier releases.  All passed 100%.
>>
>> Unless I'm missing something, that indicates that the tests furnished in
>> 11.0 do not contain instances that exercise these changes.  My guess is
>> that the 10.0 tests were also deficient.
>>
>> I have been relying on the UCD to furnish tests that have enough coverage
>> to sufficiently exercise the algorithms that are specified in UAX 31, but
>> that appears to have been naive on my part
>>
>
>

Re: Missing UAX#31 tests?

2018-07-08 Thread Mark Davis ☕️ via Unicode

I'm surprised that the tests for 11.0 passed for a 10.0 implementation,
because the following should have triggered a difference for WB. Can you
check on this particular case?

÷ 0020 × 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷
[0.3]

About the testing:

The tests are generated so that they go all the combinations of pairs, and
some combinations of triples. The generated test cases use a sample from
each partition of characters, to cut down on the file size to a reasonable
level. That also means that some changes in the rules don't cause changes
in the test results. Because it is not possible to test every combination,
so there is also provision for additional test cases, such as those at the
end of the files, eg:

https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html

We should extend those each time to make sure we cover combinations that
aren't covered by pairs. There were some additions to that end; if they
didn't cover enough cases, then we can look at your experience to add more.

I can suggest two strategies for further testing:

1. To do a full test, for each row check every combinations obtained by
replacing each sample character by every other character in its
partition. Eg for the above line that would mean testing every  sequence.

2. Use a monkey test against ICU. That is, generate random combinations of
characters from different partitions and check that ICU and your
implementation are in sync.

3. During the beta period, test your previous-version with the new test
files. If there are no failures, yet there are changes in the rules, then
raise that issue during the beta period so we can add tests.

4. If possible, during the beta period upgrade your implementation and test
against the new and old test files.

Anyone else have other suggestions for testing?

Mark

Mark

On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode <
unicode@unicode.org> wrote:

> I am working on upgrading from Unicode 10 to Unicode 11.
>
> I used all the new files.
>
> The algorithms for some of the boundaries, like GCB and WB, have changed
> so that some of the property values no longer have code points associated
> with them.
>
> I ran the tests furnished in 11.0 for these boundaries, without having
> changed the algorithms from earlier releases.  All passed 100%.
>
> Unless I'm missing something, that indicates that the tests furnished in
> 11.0 do not contain instances that exercise these changes.  My guess is
> that the 10.0 tests were also deficient.
>
> I have been relying on the UCD to furnish tests that have enough coverage
> to sufficiently exercise the algorithms that are specified in UAX 31, but
> that appears to have been naive on my part
>

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-13 Thread Mark Davis ☕️ via Unicode

> That is, why is conforming to UAX #31 worth the risk of prohibiting the
use of characters that some users might want to use?

One could parse for certain sequences, putting characters into a number of
broad categories. Very approximately:

   - junk ~= [[:cn:][:cs:][:co:]]+
   - whitespace ~= [[:z:][:c:]-junk]+
   - syntax ~= [[:s:][:p:]] // broadly speaking, including both the
   language syntax & user-named operators
   - identifiers ~= [all-else]+

UAX #31 specifies several different kinds of identifiers, and takes roughly
that approach for
http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the
focus there is on immutability.

So an implementation could choose to follow that course, rather than the
more narrowly defined identifiers in
http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively,
one can conform to the Default Identifiers but declare a profile that
expands the allowable characters. One could take a Swiftian approach
,
for example...

Mark

On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode <
unicode@unicode.org> wrote:

> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen 
> wrote:
> > Considering that ruling out too much can be a problem later, but just
> > treating anything above ASCII as opaque hasn't caused trouble (that I
> > know of) for HTML other than compatibility issues with XML's stricter
> > stance, why should a programming language, if it opts to support
> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> > complexity of UAX #31 instead of allowing everything above ASCII in
> > identifiers? In other words, what problem does making a programming
> > language conform to UAX #31 solve?
>
> After refreshing my memory of XML history, I realize that mentioning
> XML does not helpfully illustrate my question despite the mention of
> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
> ignore the XML part.
>
> Trying to rephrase my question more clearly:
>
> Let's assume that we are designing a computer-parseable syntax where
> tokens consisting of user-chosen characters can't occur next to each
> other and, instead, always have some syntax-reserved characters
> between them. That is, I'm talking about syntaxes that look like this
> (could be e.g. Java):
>
> ab.cd();
>
> Here, ab and cd are tokens with user-chosen characters whereas space
> (the indent),  period, parenthesis and the semicolon are
> syntax-reserved. We know that ab and cd are distinct tokens, because
> there is a period between them, and we know the opening parethesis
> ends the cd token.
>
> To illustrate what I'm explicitly _not_ talking about, I'm not talking
> about a syntax like this:
>
> αβ⊗γδ
>
> Here αβ and γδ are user-named variable names and ⊗ is a user-named
> operator and the distinction between different kinds of user-named
> tokens has to be known somehow in order to be able to tell that there
> are three distinct tokens: αβ, ⊗, and γδ.
>
> My question is:
>
> When designing a syntax where tokens with the user-chosen characters
> can't occur next to each other without some syntax-reserved characters
> between them, what advantages are there from limiting the user-chosen
> characters according to UAX #31 as opposed to treating any character
> that is not a syntax-reserved character as a character that can occur
> in user-named tokens?
>
> I understand that taking the latter approach allows users to mint
> tokens that on some aesthetic measure don't make sense (e.g. minting
> tokens that consist of glyphless code points), but why is it important
> to prescribe that this is prohibited as opposed to just letting users
> choose not to mint tokens that are inconvenient for them to work with
> given the behavior that their plain text editor gives to various
> characters? That is, why is conforming to UAX #31 worth the risk of
> prohibiting the use of characters that some users might want to use?
> The introduction of XID after ID and the introduction of Extended
> Hashtag Identifiers after XID is indicative of over-restriction having
> been a problem.
>
> Limiting user-minted tokens to UAX #31 does not appear to be necessary
> for security purposes considering that HTML and CSS exist in a
> particularly adversarial environment and get away with taking the
> approach that any character that isn't a syntax-reserved character is
> collected as part of a user-minted identifier. (Informally, both treat
> non-ASCII characters the same as an ASCII underscore. HTML even treats
> non-whitespace, non-U+ ASCII controls that way.)
>
> --
> Henri Sivonen
> hsivo...@hsivonen.fi
> https://hsivonen.fi/
>
>

Re: The Unicode Standard and ISO

2018-06-12 Thread Mark Davis ☕️ via Unicode

Steven wrote:

>  I usually recommend creating a new project first...

That is often a viable approach. But proponents shouldn't get the wrong
impression. I think the chance of anything resembling the "localized
sentences" / "international message components" have  zero chance of being
adopted by Unicode (including the encoding, CLDR, anything). It is a waste
of many people's time discussing it further on this list.

Why? As discussed many times on this list, it would take a major effort, is
not scoped properly (the translation of messages depends highly on context,
including specific products), and would not meet the needs of practically
anyone.

People interested in this topic should
(a) start up their own project somewhere else,
(b) take discussion of it off this list,
(c) never bring it up again on this list.

Mark

On Tue, Jun 12, 2018 at 4:53 PM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

>
> William,
>
> On 12/06/18 12:26, William_J_G Overington wrote:
> >
> > Hi Marcel
> >
> > > I don’t fully disagree with Asmus, as I suggested to make available
> localizable (and effectively localized) libraries of message components,
> rather than of entire messages.
> >
> > Could you possibly give some examples of the message components to which
> you refer please?
> >
>
> Likewise I’d be interested in asking Jonathan Rosenne for an example or
> two of automated translation from English to bidi languages with data
> embedded,
> as on Mon, 11 Jun 2018 15:42:38 +, Jonathan Rosenne via Unicode wrote:
> […]
> > > > One has to see it to believe what happens to messages translated
> mechanically from English to bidi languages when data is embedded in the
> text.
>
> But both would require launching a new thread.
>
> Thinking hard enough, I’m even afraid that most subscribers wouldn’t be
> interested, so we’d have to move off-list.
>
> One alternative I can think of is to use one of the CLDR mailing lists. I
> subscribed to CLDR-users when I was directed to move there some technical
> discussion
> about keyboard layouts from Unicode Public.
>
> But now as international message components are not yet a part of CLDR,
> we’d need to ask for extra permission to do so.
>
> An additional drawback of launching a technical discussion right now is
> that significant parts of CLDR data are not yet correctly localized so
> there is another
> bunch of priorities under July 11 deadline. I guess that vendors wouldn’t
> be glad to see us gathering data for new structures while level=Modern
> isn’t complete.
>
> In the meantime, you are welcome to contribute and to motivate missing
> people to do the same.
>
> Best regards,
>
> Marcel
>
>

Re: UTS#51 and emoji-sequences.txt

2018-06-09 Thread Mark Davis ☕️ via Unicode

Thanks, it definitely looks like there are some mismatches in terminology
there. Can you please file this with the reporting form on the unicode site?

{phone}

On Sat, Jun 9, 2018, 05:00 Yifán Wáng via Unicode 
wrote:

> When I'm looking at
> https://unicode.org/Public/emoji/11.0/emoji-sequences.txt
>
> It goes on line 16 that:
> --
> #   type_field: any of {Emoji_Combining_Sequence, Emoji_Flag_Sequence,
> Emoji_Modifier_Sequence}
> # The type_field is a convenience for parsing the emoji sequence
> files, and is not intended to be maintained as a property.
> --
>
> This field, however, actually contains "Emoji_Keycap_Sequence" and
> "Emoji_Tag_Sequence", instead of "Emoji_Combining_Sequence" (it was
> already so in 5.0).
>
> And I go back to
> http://www.unicode.org/reports/tr51/
>
> Under the section 1.4.6:
> --
> ED-21. emoji keycap sequence set — The specific set of emoji sequences
> listed in the emoji-sequences.txt file [emoji-data] under the category
> Emoji_Keycap_Sequence.
> ED-22. emoji modifier sequence set — The specific set of emoji
> sequences listed in the emoji-sequences.txt file [emoji-data] under
> the category Emoji_Modifier_Sequence.
> ED-23. RGI emoji flag sequence set — The specific set of emoji
> sequences listed in the emoji-sequences.txt file [emoji-data] under
> the category Emoji_Flag_Sequence.
> ED-24. RGI emoji tag sequence set — The specific set of emoji
> sequences listed in the emoji-sequences.txt file [emoji-data] under
> the category Emoji_Tag_Sequence.
> --
>
> I'm not sure if the "category" means "type_field" or headings in the
> txt file, as the headings do not contain underscores. If it means
> "type_field", then the description of type_field above is wrong.
>
> Also the section 1.4.5:
> --
> ED-14c. emoji keycap sequence — A sequence of the following form:
>
> emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3}
>
> - These characters are in the emoji-sequences.txt file listed under
> the category Emoji_Keycap_Sequence
> --
> While in the previous version (rev. 12):
> --
> ED-14c. emoji keycap sequence — An emoji combining sequence of the
> following form:
>
> emoji_keycap_sequence := [0-9#*] \x{FE0F 20E3}
>
> - These characters are in the emoji-sequences.txt file listed under
> the category Emoji_Combining_Keycap_Sequence
> --
>
> It seems there was some kind of confusion on terms, but anyway, isn't
> the last line of ED-14c redundant with the current revision? (Or
> "Emoji_Combining_Sequence" is intended?)
>
> Thank you.
>
> Wang Yifan
>
>

Re: The Unicode Standard and ISO

2018-06-08 Thread Mark Davis ☕️ via Unicode

Mark

On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Fri, 8 Jun 2018 05:32:51 +0200 (CEST)
> Marcel Schneider via Unicode  wrote:
>
> > Thank you for confirming. All witnesses concur to invalidate the
> > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. —
> > After being invented in its actual form, sorting was standardized
> > simultaneously in ISO/IEC 14651 and in Unicode Collation Algorithm,
> > the latter including practice‐oriented extra features.
>
> The UCA contains features essential for respecting canonical
> equivalence.  ICU works hard to avoid the extra effort involved,
> apparently even going to the extreme of implicitly declaring that
> Vietnamese is not a human language.

A bit over the top, eh?

> (Some contractions are not
> supported by ICU!)

I'm guessing you mean https://unicode.org/cldr/trac/ticket/10868, which
nicely outlines a proposal for dealing with a number of problems with
Vietnamese.

We clearly don't support every sorting feature that various dictionaries
and agencies come up with. Sometimes it is because we can't (yet) see a
good way to do it:

   1. it might be not determinant: many governmental standards or style
   sheets require "interesting" sorting, such as determining that "XI" is a
   roman numeral (not the president of China) and sorting as 11, or when "St."
   is meant to be Street *and* when meant to be Saint (St. Stephen's St.)
   2. the prospective cost in memory, code complexity, or performance, or
   the time necessary to figure out to do complex requirements, doesn't seem
   to warrant adding it at this point. Now, if you or others are interested
   in proposing specific patches to address certain issues, then you can
   propose that. Best to make a proposal (ticket) before doing the work,
   because if the solution is very intricate, even the time necessary to
   evaluate the patch can be too much to fit into the schedule. For that
   reason, it is best to break up such tickets into small, tractable pieces.

The synchronisation is manifest in the DUCET
> collation, which seems to make the effort to ensure that some canonical
> equivalent will sort the same way under ISO/IEC 14651.
>
> > Since then,
> > these two standards are kept in synchrony uninterruptedly.
>
> But the consortium has formally dropped the commitment to DUCET in
> CLDR.  Even when restricted to strings of assigned characters, the CLDR
> and ICU no longer make the effort to support the DUCET collation.
> Indeed, I'm not even sure that the DUCET is a tailoring of the root CLDR
> collation, even when restricted to assigned characters.  Tailorings
> tend to have odd side effects; fortunately, they rarely if ever matter.
> CLDR root is a rewrite with modifications of DUCET; it has changes that
> are prohibited as 'tailorings'!
>

CLDR does make some tailorings to the DUCET to create its root collation,
notably adding special contractions of private use characters to allow for
tailoring support and indexes [
http://unicode.org/reports/tr35/tr35-collation.html#File_Format_FractionalUCA_txt
]  plus the rearrangement of some characters (mostly punctuation and
symbols) to allow runtime parametric reordering of groups of characters (eg
to put numbers after letters) [
http://unicode.org/reports/tr35/tr35-collation.html#grouping_classes_of_characters
].

   - If there are other changes that are not well documented, or if you
   think those features are causing problems in some way, please file a
   ticket.
   - If there is a particular change that you think is not conformant to
   UCA, please also file that.

> Richard.
>
>

Re: The Unicode Standard and ISO

2018-06-08 Thread Mark Davis ☕️ via Unicode

Where are you getting your "facts"? Among many unsubstantiated or ambiguous
claims in that very long sentence:

   1. "French locale in CLDR is still surprisingly incomplete".
  1. For each release, the data collected for the French locale is
  complete to the bar we have set for Level=Modern.
  2. What you may mean is that CLDR doesn't support a structure that
  you think it should. For that, you have to make a compelling
case that the
  structure you propose is worth it, worth diverting people from other
  priorities.
   2. French contributors are not "prevented from cooperating". Where do
   you get this from? Who do you mean?
  1. We have many French contribute data over time. Now, it works
  better when people engage under the umbrella of an organization, but even
  there that doesn't have to be a company; we have liaison
relationships with
  government agencies and NGOs.
   3. There were not "many attempts" at a merger, and Unicode didn't
   "refuse" anything. Who do you think "attempted", and when?
   1. Albeit given the state of ISO/IEC 15897, there was nothing such a
  merger would have contributed anyway.
  2. BTW, your use of the term "refuse" might be a language issue. I
  don't "refuse" to respond to the widow of a Nigerian Prince who wants to
  give me $1M. Since I don't think it is worth my time, or am not
  willing to upfront the low, low fee of $10K, I might "ignore" the
  email, or "not respond" to it. Or I might "decline" it with a
no-thanks or
  not-interested response. But none of that is to "refuse" it.

Mark

On Fri, Jun 8, 2018 at 5:32 AM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Thu, 7 Jun 2018 22:46:12 +0300, Erkki I. Kolehmainen via Unicode wrote:
> >
> > I cannot but fully agree with Mark and Michael.
> >
> > Sincerely
> >
>
> Thank you for confirming. All witnesses concur to invalidate the statement
> about
> uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — After being invented in
> its
> actual form, sorting was standardized simultaneously in ISO/IEC 14651 and
> in
> Unicode Collation Algorithm, the latter including practice‐oriented extra
> features.
> Since then, these two standards are kept in synchrony uninterruptedly.
>
> Getting people to correct the overall response was not really my initial
> concern,
> however. What bothered me before I learned that Unicode refuses to
> cooperate
> with ISO/IEC JTC1 SC22 is that the registration of the French locale in
> CLDR is
> still surprisingly incomplete despite the meritorious efforts made by the
> actual
> contributors, and then after some investigation, that the main part of the
> potential
> French contributors are prevented from cooperating because Unicode refuses
> to
> cooperate with ISO/IEC on locale data while ISO/IEC 15897 predates CLDR,
> reportedly after many attempts made to merge both standards, remaining
> unsuccessful without any striking exposure or friendly agreement to avoid
> kind of
> an impression of unconcerned rebuff.
>
> Best regards,
>
> Marcel
>
>

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Mark Davis ☕️ via Unicode

Got it, thanks.

Mark

On Thu, Jun 7, 2018 at 3:29 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Thu, 7 Jun 2018 10:42:46 +0200
> Mark Davis ☕️ via Unicode  wrote:
>
> > > The proposal also asks for identifiers to be treated as equivalent
> > > under
> > NFKC.
> >
> > The guidance in #31 may not be clear. It is not to replace
> > identifiers as typed in by the user by their NFKC equivalent. It is
> > rather to internally *identify* two identifiers (as typed in by the
> > user) as being the same. For example, Pascal had case-insensitive
> > identifiers. That means someone could type in
> >
> > myIdentifier = 3;
> > MyIdentifier = 4;
> >
> > And both of those would be references to the same internal entity. So
> > cases like SARA AM doesn't necessarily play into this.
>
> There has been a suggestion to not just restrict identifiers to NFKC
> equivalence classes (UAX31-R4), but to actually restrict them to NFKC
> form (UAX31-R6).  That is where the issue with SARA AM changes from a
> lurking issue to an active problem.  Others have realised that NFC
> makes more sense than NFKC for Rust.
>
> Richard.
>
>
>

Re: The Unicode Standard and ISO

2018-06-07 Thread Mark Davis ☕️ via Unicode

A few facts.

> ... Consortium refused till now to synchronize UCA and ISO/IEC 14651.

ISO/IEC 14651 and Unicode have longstanding cooperation. Ken Whistler could
speak to the synchronization level in more detail, but the above statement
is inaccurate.

> ... For another part it [sync with ISO/IEC 15897] failed because the
Consortium refused to cooperate, despite of
repeated proposals for a merger of both instances.

I recall no serious proposals for that.

(And in any event — very unlike the synchrony with 10646 and 14651 — ISO 15897
brought no value to the table. Certainly nothing to outweigh the
considerable costs of maintaining synchrony. Completely inadequate
structure for modern system requirement, no particular industry support,
and scant content: see Wikipedia for "The registry has not been updated
since December 2001".)

Mark

Mark

On Thu, Jun 7, 2018 at 1:25 PM, Marcel Schneider via Unicode <
unicode@unicode.org> wrote:

> On Thu, 17 May 2018 09:43:28 -0700, Asmus Freytag via Unicode wrote:
> >
> > On 5/17/2018 8:08 AM, Martinho Fernandes via Unicode wrote:
> > > Hello,
> > >
> > > There are several mentions of synchronization with related standards in
> > > unicode.org, e.g. in https://www.unicode.org/versions/index.html, and
> > > https://www.unicode.org/faq/unicode_iso.html. However, all such
> mentions
> > > never mention anything other than ISO 10646.
> >
> > Because that is the standard for which there is an explicit
> understanding by all involved
> > relating to synchronization. There have been occasionally some
> challenging differences
> > in the process and procedures, but generally the synchronization is
> being maintained,
> > something that's helped by the fact that so many people are active in
> both arenas.
>
> Perhaps the cause-effect relationship is somewhat unclear. I think that
> many people being
> active in both arenas is helped by the fact that there is a strong will to
> maintain synching.
>
> If there were similar policies notably for ISO/IEC 14651 (collation) and
> ISO/IEC 15897
> (locale data), ISO/IEC 10646 would be far from standing alone in the field
> of
> Unicode-ISO/IEC cooperation.
>
> >
> > There are really no other standards where the same is true to the same
> extent.
> > >
> > > I was wondering which ISO standards other than ISO 10646 specify the
> > > same things as the Unicode Standard, and of those, which ones are
> > > actively kept in sync. This would be of importance for standardization
> > > of Unicode facilities in the C++ language (ISO 14882), as reference to
> > > ISO standards is generally preferred in ISO standards.
> > >
> > One of the areas the Unicode Standard differs from ISO 10646 is that its
> conception
> > of a character's identity implicitly contains that character's
> properties - and those are
> > standardized as well and alongside of just name and serial number.
>
> This is probably why, to date, ISO/IEC 10646 features character properties
> by including
> normative references to the Unicode Standard, Standard Annexes, and the
> UCD.
> Bidi-mirroring e.g. is part of ISO/IEC 10646 that specifies in clause 15.1:
>
> “[…] The list of these characters is determined by having the
> ‘Bidi_Mirrored’ property
> set to ‘Y’ in the Unicode Standard. These values shall be determined
> according to
> the Unicode Standard Bidi Mirrored property (see Clause 2).”
>
> >
> > Many of these properties have associated with them algorithms, e.g. the
> bidi algorithm,
> > that are an essential element of data interchange: if you don't know
> which order in
> > the backing store is expected by the recipient to produce a certain
> display order, you
> > cannot correctly prepare your data.
> >
> > There is one area where standardization in ISO relates to work in
> Unicode that I can
> > think of, and that is sorting.
>
> Yet UCA conforms to ISO/IEC 14651 (where UCA is cited as entry #28 in the
> bibliography).
> The reverse relationship is irrelevant and would be unfair, given that the
> Consortium
> refused till now to synchronize UCA and ISO/IEC 14651.
>
> Here is a need for action.
>
> > However, sorting, beyond the underlying framework,
> > ultimately relates to languages, and language-specific data is now
> housed in CLDR.
> >
> > Early attempts by ISO to standardize a similar framework for locale data
> failed, in
> > part because the framework alone isn't the interesting challenge for a
> repository,
> > instead it is the collection, vetting and management of the data.
>
> For another part it failed because the Consortium refused to cooperate,
> despite of
> repeated proposals for a merger of both instances.
>
> >
> > The reality is that the ISO model and its organizational structures are
> not well suited
> > to the needs of many important area where some form of standardization
> is needed.
> > That's why we have organization like IETF, W3C, Unicode etc..
> >
> > Duplicating all or even part of their effort inside ISO really serves
> nobody's

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Mark Davis ☕️ via Unicode

> The proposal also asks for identifiers to be treated as equivalent under
NFKC.

The guidance in #31 may not be clear. It is not to replace identifiers as
typed in by the user by their NFKC equivalent. It is rather to internally
*identify* two identifiers (as typed in by the user) as being the same. For
example, Pascal had case-insensitive identifiers. That means someone could
type in

myIdentifier = 3;
MyIdentifier = 4;

And both of those would be references to the same internal entity. So cases
like SARA AM doesn't necessarily play into this.

> IMO the major issue with non-ASCII identifiers is not a technical one,
but rather that it runs the risk of fragmenting the developer community.

IMO, forcing everyone to stick to the limitations of ASCII for all
identifiers is unnecessary and often counterproductive.

First, programmers tend to think of "identifiers" as being specifically
"identifiers in programming languages" (and often "identifiers in
programming languages that I think are important". Identifiers may occur in
much broader contexts, often being much closer to end users (eg spreadsheet
formulae) or scripting languages, user identifiers, and so on.

Secondly, even with programming languages that are restricted to ASCII,
people can choose identifiers in code like the following, which would not
be obvious to many people.

var Stellenwert = Verteidigungsministerium_Konto.verarbeite(); // Asmus
könnte realistischere Beispiele vorschlagen

For a given project, and for programming languages (as opposed to more
user-facing languages) the language to be used for variables, functions,
comments,  will often be English, to allow for broader participation.
But that should be a choice of the people involved. There are clearly many
cases where that restriction is not optimal for a given project, where not
all of the developers (and prospective developers) are fluent in English,
but do share another common language. Think of all the in-house development
in countries and organizations around the world.

And finally, it's not like you hear of huge problems from Java or Swift or
other programming languages because they support non-ASCII identifiers.

Mark

On Thu, Jun 7, 2018 at 9:36 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Tue, 5 Jun 2018 01:37:47 +0100
> Richard Wordingham via Unicode  wrote:
>
> > The decomposed
> > form that looks the same is นํ้า .
> > The problem is that for sane results,  needs
> > special handling. This sequence is also often untypable - part of the
> > protection against Thai homographs.
>
> I've been misquoted on the Rust discussion topic - or the behaviour is
> more diverse that I was aware of.  On LibreOffice, with sequence
> checking not disabled, typing  disables the input by
> typing of U+0E49 or U+0E32 immediately afterwards.  Another mechanism
> is for typing another vowel to replace the U+0E4D.  The problem here is
> that in standard Thai, U+0E4D may not be followed by another vowel or
> tone mark, so Wing Thuk Thi (WTT) rules cut in.  (They're also quite
> good at preventing one from typing Northern Khmer.)  In LibreOffice,
> typing the NFKC form  is stopped at
> attempting to type U+0E4D, though one can get back to the original by
> typing U+0E33 instead.  To the rule checker, that is mission
> accomplished!
>
> Richard.
>
>

Re: Submissions open for 2020 Emoji

2018-04-20 Thread Mark Davis ☕️ via Unicode

BTW, Slide 23 on http://unicode.org/emoji/slides.html ("Unicode Resources:
Specs, Data, and Code") shows one view of the relative sizes of Unicode
Consortium projects, divided up by cldr, icu, encoding (eg UTC output), and
also breaks out emoji.

(It does need a bit of updating, since we have added emoji names to cldr.)

Mark

On Thu, Apr 19, 2018 at 2:32 PM, Mark Davis ☕️ <m...@macchiato.com> wrote:

> > imagine I discover that someone has already proposed the emoji that I
> am interested in
>
> In some cases we've have contacted people to see if they want to engage
> with other proposers. But to handle larger numbers we'd need a simple,
> light-weight way to let people know, while maintaining people's privacy
> when they want it.
>
> > Also, there seems to be no systematic reason...
>
> The ESC periodically prioritizes some of the larger sets and forwards a
> list to the UTC.
>
> >If an emoji proposal is well-formed and fits the general scope it should
> be forwarded to UTC.
>
> Emoji are a relatively small part of the work of the consortium, and
> should remain that way. So the UTC depends on the ESC to evaluate the
> quality and priority of proposals, based on the factors described.
>
> > Others are outdated, for instance because the larger set they have been
> added to has already been processed by UTC and they were declined. Some
> categories have only a single entry, others are clearly aliases of each
> other or subcategories.
> > I would like to help clean up the data, e.g. by commenting on the Google
> Spreadsheet that is embedded on the Unicode page. How can I do that as an
> individual member?
>
> That would be helpful, thanks. What I would suggest is taking a copy of
> the sheet, dumping into a spreadsheet (Google or Excel) and adding a column
> for your suggestions. You can then submit that. Note that the numbers are
> just to provide a count, there is no binding connection between them and
> the rest of the line.
>
> Mark
>
> Mark
>
> On Thu, Apr 19, 2018 at 12:51 PM, Christoph Päper via Unicode <
> unicode@unicode.org> wrote:
>
>> announceme...@unicode.org:
>> >
>> > The emoji subcommittee has also produced a new page which shows the
>> > Emoji Requests <http://www.unicode.org/emoji/emoji-requests.html>
>> > submitted so far. You can look at what other people have proposed or
>> > suggested. In many cases, people have made suggestions, but have not
>> > followed through with complete submission forms, or have submitted
>> > forms, but not followed through on requested modifications to the forms.
>>
>> This good news! However, imagine I discover that someone has already
>> proposed the emoji that I am interested in, but their formal proposal needs
>> some work: From the public data I can not see when this proposal has been
>> received or whether it has been updated. Since I also cannot contact the
>> author, either I have to hope they are still working on the proposal or I
>> have to submit a separate proposal of my own, duplicating all the work.
>>
>> Also, there seems to be no systematic reason for which proposals get
>> shelved as "Added to larger set" while related ones (e.g. random animals)
>> progress to the UTC. The ESC should not have this power of gatekeeping. If
>> an emoji proposal is well-formed and fits the general scope it should be
>> forwarded to UTC, hence be published in the L2 repository. Alternatively,
>> the ESC should collect *all* proposals that semantically belong to a larger
>> set (e.g. animals) in a composite document and forward this annually, for
>> instance.
>>
>> Some entries are also opaque or ambiguous, i.e. not helpful, e.g.:
>>
>> 705 Six Chinese Styles  Added to larger set Mixed
>> 706 Six Chinese-style Emoji No proposal formOther
>>
>> Others are outdated, for instance because the larger set they have been
>> added to has already been processed by UTC and they were declined. Some
>> categories have only a single entry, others are clearly aliases of each
>> other or subcategories. I would like to help clean up the data, e.g. by
>> commenting on the Google Spreadsheet that is embedded on the Unicode page.
>> How can I do that as an individual member?
>>
>
>

Re: Submissions open for 2020 Emoji

2018-04-20 Thread Mark Davis ☕️ via Unicode

If you want, you can make a proposal to the effect that all proposals made
to the Unicode be hosted publicly in a place accessible the unicode site.
Then the UTC can consider your proposal.

I think it would help the discussion to provide in your proposal links to
policy statements from the W3C, ICANN, etc. that follow that policy. (I'm
not sure exactly what you encompass in your term "public standard": for
example, would you include ISO in that list, even though people have to pay
for (most of) theirs?)

Mark

Mark

On Thu, Apr 19, 2018 at 8:50 PM, Asmus Freytag (c) <asm...@ix.netcom.com>
wrote:

> On 4/19/2018 9:36 AM, Mark Davis ☕️ wrote:
>
> The UTC didn't want to burden the doc registry with all the emoji
> proposals.
>
>
> The question of whether the registry should be divided is independent on
> whether proposals are public or private in nature.
>
> Proposals in private have no place in the context of public standard.
>
> A./
>
>
> Mark
>
> On Thu, Apr 19, 2018 at 6:22 PM, Asmus Freytag via Unicode <
> unicode@unicode.org> wrote:
>
>> On 4/19/2018 5:32 AM, Mark Davis ☕️ via Unicode wrote:
>>
>> > imagine I discover that someone has already proposed the emoji that I
>> am interested in
>>
>> In some cases we've have contacted people to see if they want to engage
>> with other proposers. But to handle larger numbers we'd need a simple,
>> light-weight way to let people know, while maintaining people's privacy
>> when they want it.
>>
>>
>> I would tend to think that actual proposals are a matter of public
>> record. Emoji should not be handled differently than other proposals for
>> character encoding in that regard.
>>
>> Why should there be an assumption that these are "proposals in private"
>> in this case?
>>
>> A./
>>
>
>
>

Re: Submissions open for 2020 Emoji

2018-04-19 Thread Mark Davis ☕️ via Unicode

The UTC didn't want to burden the doc registry with all the emoji proposals.

Mark

On Thu, Apr 19, 2018 at 6:22 PM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 4/19/2018 5:32 AM, Mark Davis ☕️ via Unicode wrote:
>
> > imagine I discover that someone has already proposed the emoji that I
> am interested in
>
> In some cases we've have contacted people to see if they want to engage
> with other proposers. But to handle larger numbers we'd need a simple,
> light-weight way to let people know, while maintaining people's privacy
> when they want it.
>
>
> I would tend to think that actual proposals are a matter of public record.
> Emoji should not be handled differently than other proposals for character
> encoding in that regard.
>
> Why should there be an assumption that these are "proposals in private" in
> this case?
>
> A./
>

Unicode Utilities

2018-03-23 Thread Mark Davis ☕️ via Unicode

For testing, the Unicode Utilities now support the Unicode beta properties
(with some caveats). Example: \p{gcβ=Lu}-\p{gc=Lu}

.

Thanks to Sascha for helping to move to different infrastructure for the
utilities...

Mark

Re: Full Emoji List Chart No Longer Displaying Emoji with Skin-tones

2018-03-17 Thread Mark Davis ☕️ via Unicode

You can take a look at emojipedia. They have a good set of information
about emoji glyphs.

{phone}

On Sat, Mar 17, 2018, 17:44 Ed Borgquist <ed.borgqu...@website.ws> wrote:

> Thanks for the information. Does Unicode make public the source images
> received from vendors? Or, is there somewhere else you would recommend for
> me to look?
>
>
>
> Kindest Regards,
>
>
>
> Ed Borgquist
>
> .WS Registry
>
>
>
> *From:* mark.edward.da...@gmail.com [mailto:mark.edward.da...@gmail.com] *On
> Behalf Of *Mark Davis ??
> *Sent:* Saturday, March 17, 2018 5:20 AM
> *To:* Ed Borgquist
> *Cc:* Unicode Public
> *Subject:* Re: Full Emoji List Chart No Longer Displaying Emoji with
> Skin-tones
>
>
>
> We were getting so much traffic on the emoji pages that we had to produce
> an abbreviated version to reduce the load (without skin tones, it is about
> half the size).
>
>
>
> We are looking at improvements to the infrastructure and/or chart design
> that would let us restore them, but people are busy with other Unicode
> projects right now.
>
>
> Mark
>
>
>
> On Sat, Mar 17, 2018 at 1:56 AM, Ed Borgquist via Unicode <
> unicode@unicode.org> wrote:
>
> Hello All,
>
> The Full Emoji List [1] had, in the past, displayed Emoji with all skin
> tone variants. It seems that this is no longer the case. Does anyone know
> if it is possible that this could return in the future?
>
> This data was useful for myself, as scraping this data allowed for me to
> identify "homographic" Emoji from a variety of vendors. Additionally, I
> could see how vendors approached skin tone variants for
> difficult-to-distinguish Emoji (for example, SNOWBOARDER often features a
> person with no visible skin).
>
> [1] https://unicode.org/emoji/charts/full-emoji-list.html
>
> Kindest Regards,
>
> Ed Borgquist
> .WS Registry
>
>
>

Re: Full Emoji List Chart No Longer Displaying Emoji with Skin-tones

2018-03-17 Thread Mark Davis ☕️ via Unicode

We were getting so much traffic on the emoji pages that we had to produce
an abbreviated version to reduce the load (without skin tones, it is about
half the size).

We are looking at improvements to the infrastructure and/or chart design
that would let us restore them, but people are busy with other Unicode
projects right now.

Mark

On Sat, Mar 17, 2018 at 1:56 AM, Ed Borgquist via Unicode <
unicode@unicode.org> wrote:

> Hello All,
>
> The Full Emoji List [1] had, in the past, displayed Emoji with all skin
> tone variants. It seems that this is no longer the case. Does anyone know
> if it is possible that this could return in the future?
>
> This data was useful for myself, as scraping this data allowed for me to
> identify "homographic" Emoji from a variety of vendors. Additionally, I
> could see how vendors approached skin tone variants for
> difficult-to-distinguish Emoji (for example, SNOWBOARDER often features a
> person with no visible skin).
>
> [1] https://unicode.org/emoji/charts/full-emoji-list.html
>
> Kindest Regards,
>
> Ed Borgquist
> .WS Registry
>
>

Re: A sketch with the best-known Swiss tongue twister

2018-03-09 Thread Mark Davis ☕️ via Unicode

> In summary you do not object the fact that unqualified "gsw" language code

Whether I object or not makes no difference.

Whether for good or for bad, the gsw code (clearly originally for
German-Swiss from the code letters) has been expanded beyond the borders of
Switzerland. There are also separate codes for Schwäbisch and
Waliserdütsch, so outside of Switzerland 'gsw' mainly extends to Elsassisch
(Alsace, ~0.5M speakers). So gsw-CH works to limit the scope to Switzerland
(~4.5M speakers).

> My opinion is that even the Swiss variants should be preferably named
"Swiss Alemannic" collectively...

That's clearly also not going to happen for the English term. Good luck
with the French equivalent...

Mark

On Fri, Mar 9, 2018 at 3:52 PM, Philippe Verdy <verd...@wanadoo.fr> wrote:

> In summary you do not object the fact that unqualified "gsw" language code
> is not (and should not be) named "Swiss German" (as it is only for
> "gsw-CH", not for any other non-Swiss variants of Alemannic).
>
> The addition of "High" is optional, unneeded in fact, as it does not
> remove any ambiguity, in Germany for "de-DE", or in Switzerland for
> "de-CH", or in Italian South Tyrol for "de-IT", or in Austria for "de-AT",
> or even for "Standard German" (de)
>
> Note also that Alsatian itself ("gsw-FR") is considered part of the "High
> German" branch of Germanic languages !
>
> "High German" refers to the group that includes Standard German and its
> national variants ("de", "de-DE", "de-CH", "de-AT", "de-CH", "de-IT") as
> well as the Alemannic group ( "gsw" , "gsw-FR", "gsw-CH"), possibly extended
> (this is discutable) to Schwäbish in Germany and Hungary.
>
> My opinion is that even the Swiss variants should be preferably named
> "Swiss Alemannic" collectively, and not "Swiss German" which causes
> constant confusion between "de-CH" and "gsw-CH".
>
>
> 2018-03-09 15:11 GMT+01:00 Mark Davis ☕️ via Unicode <unicode@unicode.org>
> :
>
>> Yes, the right English names are "Swiss High German" for de-CH, and
>> "Swiss German" for gsw-CH.
>>
>> Mark
>>
>> On Fri, Mar 9, 2018 at 2:40 PM, Tom Gewecke via Unicode <
>> unicode@unicode.org> wrote:
>>
>>>
>>> > On Mar 9, 2018, at 5:52 AM, Philippe Verdy via Unicode <
>>> unicode@unicode.org> wrote:
>>> >
>>> > So the "best-known Swiss tongue" is still not so much known, and still
>>> incorrectly referenced (frequently confused with "Swiss German", which is
>>> much like standard High German
>>>
>>> I think Swiss German is in fact the correct English name for the Swiss
>>> dialects, taken from the German Schweizerdeutsch.
>>>
>>> https://en.wikipedia.org/wiki/Swiss_German
>>>
>>
>>
>

Re: A sketch with the best-known Swiss tongue twister

2018-03-09 Thread Mark Davis ☕️ via Unicode

Yes, the right English names are "Swiss High German" for de-CH, and "Swiss
German" for gsw-CH.

Mark

On Fri, Mar 9, 2018 at 2:40 PM, Tom Gewecke via Unicode  wrote:

>
> > On Mar 9, 2018, at 5:52 AM, Philippe Verdy via Unicode <
> unicode@unicode.org> wrote:
> >
> > So the "best-known Swiss tongue" is still not so much known, and still
> incorrectly referenced (frequently confused with "Swiss German", which is
> much like standard High German
>
> I think Swiss German is in fact the correct English name for the Swiss
> dialects, taken from the German Schweizerdeutsch.
>
> https://en.wikipedia.org/wiki/Swiss_German
>

Re: A sketch with the best-known Swiss tongue twister

2018-03-09 Thread Mark Davis ☕️ via Unicode

There are definitely many dialects across Switzerland. I think that for
*this* phrase it would be roughly the same for most of the population, with
minor differences (eg 'het' vs 'hät'). But a native speaker like Martin
would be able to say for sure.

Mark

On Fri, Mar 9, 2018 at 12:52 PM, Philippe Verdy <verd...@wanadoo.fr> wrote:

> Is that just for Switzerland in one of the local dialectal variants ? Or
> more generally Alemannic (also in Northeastern France, South Germany,
> Western Austria, Liechtenstein, Northern Italy).
>
> 2018-03-09 12:09 GMT+01:00 Mark Davis ☕️ via Unicode <unicode@unicode.org>
> :
>
>> https://www.youtube.com/watch?v=QOwITNazUKg
>>
>> De Papscht hät z’Schpiäz s’Schpäkchbschtekch z’schpaat bschtellt.
>> literally: The Pope has [in Spiez] [the bacon cutlery] [too late] ordered.
>>
>> Mark
>>
>
>

A sketch with the best-known Swiss tongue twister

2018-03-09 Thread Mark Davis ☕️ via Unicode

https://www.youtube.com/watch?v=QOwITNazUKg

De Papscht hät z’Schpiäz s’Schpäkchbschtekch z’schpaat bschtellt.
literally: The Pope has [in Spiez] [the bacon cutlery] [too late] ordered.

Mark

Re: Sentence_Break, Semi-colons, and Apparent Miscategorization

2018-03-08 Thread Mark Davis ☕️ via Unicode

>From the first line, I guess you mean that all three questions are having
to do with the Sentence_Break property values. Namely:

http://www.unicode.org/reports/tr29/proposed.html#Table_Sentence_Break_Property_Values
http://www.unicode.org/reports/tr29/proposed.html#SContinue

Mark

On Thu, Mar 8, 2018 at 9:25 AM, fantasai via Unicode 
wrote:

> Given that the comma and colon are categorized as SContinue,
> why is the semicolon also not SContinue?

> Also, why is the Greek Question Mark not categorized with
> the rest of the question marks?
>

As I recall
,
both are
 because the semicolon can also represent a greek question mark (they are
canonically equivalent
, so you can't reliably distinguish between them
).

BTW, here is a table of property differences for codepoint X, toNfc(X) (if
a single character) and toNfkc(X) (again, if a single character).

https://docs.google.com/spreadsheets/d/1ZExxhAujA8kX42F8KBK3okX_So7Dt5YZvyanL8dH8tM/edit#gid=0

It was a quick dump so no guarantees that all the dots are crossed. It
skips comparing properties that are purposefully different across NFC (like
Decomposition_Mapping) or different code points (like Name or Block), and
most CJK properties (ones starting with 'k').

> Why aren't the vertical presentation forms categorized with
> the things they are presenting?
>

At least some of them are:
U+FE10 ( ︐ ) PRESENTATION FORM FOR VERTICAL COMMA
U+FE11 ( ︑ ) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON
U+FE31 ( ︱ ) PRESENTATION FORM FOR VERTICAL EM DASH
U+FE32 ( ︲ ) PRESENTATION FORM FOR VERTICAL EN DASH

>
> Thanks~
> ~fantasai
>

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-02 Thread Mark Davis ☕️ via Unicode

No, the patterns should always have the right format. However, in the
supplemental data there is information as to the preferred data for each
language. This data isn't collected through the ST, so a ticket needs to be
filed.

In your particular case, the data has:

If DE just doesn't use hB, then you can file a ticket to say that it
shouldn't be in @allowed.

Note that the format permits either regions or locales, as in:

As to involvement, we try to encourage interaction on the forum. In some
languages those are quite active; in others not so much. (BTW, a number of
your suggestions made sense to me, but not being a native German speaker, I
don't weigh in on de.xml except for structural issues or where people seem
to miss the intent.) So people may look at the forum, disagree with the
proposal, but not respond why they disagree.

Mark

On Fri, Mar 2, 2018 at 3:22 PM, Christoph Päper via Unicode <
unicode@unicode.org> wrote:

> F'up2: cldr-us...@unicode.org
>
> Doug Ewell via unicode@unicode.org:
> >
> > I think that is a measurement of locale coverage -- whether the
> > collation tables and translations of "a.m." and "p.m." and "a week ago
> > Thursday" are correct and verified -- not character coverage.
>
> By the way, the binary `am` vs. `pm` distinction common in English and
> labelled `a` as a placeholder in CLDR formats is too simplistic for some
> languages when using the 12-hour clock (which they usually don't in written
> language). In German, for instance, you would always use a format with `B`
> instead (i.e. "morgens", "mittags", "abends", "nachts" or no identifier
> during daylight).
>
> How and where can I best suggest to change this in CLDR? The B formats
> have their own code, e.g. `Bhms` = `h:mm:ss B`. Should I just propose to
> set `hms` etc. to the same value next time the Survey Tool is open?
>
> In my experience, there are too few people reviewing even the "largest"
> languages (like German). I participated in v32 and v33, but other than me
> there were only contributions from (seemingly) a single employee from each
> of Apple, Google and Microsoft. Most improvements or corrections I
> suggested just got lost, i.e. nobody discussed or voted on them, so the old
> values remained.
>

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-03-02 Thread Mark Davis ☕️ via Unicode

Right, Doug. I'll say a few more words.

In terms of language support, encoding of new characters in Unicode
benefits mostly digital heritage languages (via representation of historic
languages in Unicode, enabling preservation and scholarly work), although
there are some modern-use cases like Hanifi Rohingya. We do include digital
heritage under the umbrella of "digitally disadvantaged languages", but we
are not consistent in our terminology sometimes.

But encoding is just a first step. A vital first step, but just one step.

People tend to forget that adding new characters is just a part of what
Unicode does. For script support, it is just as important to have correct
Unicode algorithms and properties, such as correct values for the
Indic_Positional_Category
property (which together with the related work in with the Universal
Shaping Engine, allows for proper rendering of many languages). Behind the
scenes we have people like Ken and Laurentiu who have to dig through the
encoding proposals and fill in the many, many gaps to come up with
reasonable properties for such basic behavior as line-break.

As important as the work is on encoding, properties, and algorithms, when
we go up a level we get CLDR and ICU. Those have more impact on language
support for far more people in the world than the addition of new scripts
does. After all, approaching half of the population of the globe owns
smartphones: ICU provides programmatic access to the Unicode encoding,
properties, and algorithms, and CLDR + ICU together provide the core
language support on essentially every one of those smartphones.

But in terms of language coverage, the chart you reference (and the
corresponding
graph ) show
how very far CLDR still has to go. So we are gearing up for ways to extend
that graph: to move at least the basic coverage (the lower plateau in that
graph) to more languages, and to move basic-coverage languages up to more
in-depth coverage. We are focusing on ways to improve the CLDR survey tool
backend and frontend, since we know it currently cannot able to handle the
number of people that want to contribute, and has glitches in the UI that
make it clumsier to use than it should be.

Well, this turned out to be more than just a few words... sorry for going
on!

Mark

On Thu, Mar 1, 2018 at 9:10 PM, Doug Ewell via Unicode 
wrote:

> Tim Partridge wrote:
>
> > Perhaps the CLDR work the Consortium does is being referenced. That is
> > by language on this list
> > http://www.unicode.org/cldr/charts/32/supplemental/locale_
> coverage.html#ee
> > By the time it gets to the 100th entry the Modern percentage has "room
> > for improvement".
>
> I think that is a measurement of locale coverage -- whether the
> collation tables and translations of "a.m." and "p.m." and "a week ago
> Thursday" are correct and verified -- not character coverage.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Mark Davis ☕️ via Unicode

I'm more interested in what areas you found unclear, because wherever you
did I'm sure many others would as well. You can reply off-list if you want.

Mark

Mark

On Wed, Feb 28, 2018 at 12:22 PM, Janusz S. Bień 
wrote:

>
> Thanks to all who answered. The answers are very clear, but the original
> message and the adoption page are in my opinion much less clear. I can
> however live with it :-)
>
> Best regards
>
> Janusz
>
> On Wed, Feb 28 2018 at 11:53 +0100, m...@macchiato.com writes:
> > Also, please click through from the announcement to
> http://www.unicode.org/consortium/adopt-a-character.html.
> >
> > If it isn't apparent from that page what the relationship is, we have
> some work to do...
> >
> > Mark
>
> > On Wed, Feb 28, 2018 at 11:48 AM, Martin J. Dürst via Unicode <
> unicode@unicode.org> wrote:
> >
> >  On 2018/02/28 19:38, Janusz S. Bień via Unicode wrote:
> >
> >  On Tue, Feb 27 2018 at 13:45 -0800, announceme...@unicode.org writes:
> >
> >  The 157 new Emoji are now available for adoption, to help the Unicode
> >  Consortium’s work on digitally disadvantaged languages.
> >
> >  I'm quite curious what it the relation between the new emojis and the
> >  digitally disadvantages languages. I see none.
> >
> >  I think this was mentioned before on this list, in particular by Mark:
> >  The money collected from character adoptions (where emoji are a
> prominent target) is (mostly?) used to support work on not-yet-encoded
> (thus digitally
> >  disadvantaged) scripts. See e.g. the recent announcement at
> http://blog.unicode.org/2018/02/adopt-character-grant-to-
> support-three.html.
>
>
>
> --
>,
> Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
> jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~
> jsbien/
>

Re: Unicode Emoji 11.0 characters now ready for adoption!

2018-02-28 Thread Mark Davis ☕️ via Unicode

Also, please click through from the announcement to
http://www.unicode.org/consortium/adopt-a-character.html.

If it isn't apparent from that page what the relationship is, we have some
work to do...

Mark

On Wed, Feb 28, 2018 at 11:48 AM, Martin J. Dürst via Unicode <
unicode@unicode.org> wrote:

> On 2018/02/28 19:38, Janusz S. Bień via Unicode wrote:
>
>> On Tue, Feb 27 2018 at 13:45 -0800, announceme...@unicode.org writes:
>>
>> The 157 new Emoji are now available for adoption, to help the Unicode
>>> Consortium’s work on digitally disadvantaged languages.
>>>
>>
>> I'm quite curious what it the relation between the new emojis and the
>> digitally disadvantages languages. I see none.
>>
>
> I think this was mentioned before on this list, in particular by Mark:
> The money collected from character adoptions (where emoji are a prominent
> target) is (mostly?) used to support work on not-yet-encoded (thus
> digitally disadvantaged) scripts. See e.g. the recent announcement at
> http://blog.unicode.org/2018/02/adopt-character-grant-to-sup
> port-three.html.
>
> Regards,   Martin.
>

Re: Why so much emoji nonsense?

2018-02-16 Thread Mark Davis ☕️ via Unicode

A few points

1. To add to what Asmus said, see also
http://unicode.org/L2/L2018/18044-encoding-emoji.pdf

"Their encoding, surprisingly, has been a boon for language support. The
emoji draw on Unicode
mechanisms that are used by various languages, but which had been
incompletely implemented on
many platforms. Because of the demand for emoji, many implementations have
upgraded their
Unicode support substantially. That means that implementations now have far
better support for the
languages that use the more complicated Unicode mechanisms."

An example of that is MySQL, where the rise of emoji led to non-BMP support.

2. Aside from SEI (at UCB), we've also been able to fund a number of
projects such as
http://blog.unicode.org/2016/12/adopt-character-grant-to-support-indic.html

4. Finally, I'd like to point out that this external mailing list is open
to anyone (subject to civil behavior), with the main goal being to provide
a forum for people to ask questions about how to deploy, use, and
contribute to Unicode, and get answers from a community of users.

Those who want to engage in extended kvetching can take that to the
rightful place: *Twitter*.

Mark

Mark

On Fri, Feb 16, 2018 at 9:25 AM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 2/15/2018 11:54 PM, James Kass via Unicode wrote:
>
> Pierpaolo Bernardi wrote:
>
>
> But it's always a good time to argue against the addition of more
> nonsense to what we already have got.
>
> It's an open-ended set and precedent for encoding them exists.
> Generally, input regarding the addition of characters to a repertoire
> is solicited from the user community, of which I am not a member.
>
> My personal feeling is that all of the time, effort, and money spent
> by the various corporations in promoting the emoji into Unicode would
> have been better directed towards something more worthwhile, such as
> the unencoded scripts listed at:
>
>  http://www.linguistics.berkeley.edu/sei/scripts-not-encoded.html
>
> ... but nobody asked me.
>
>
> Curiously enough it is the emoji that keep a large number of users (and
> companies
> serving them) engaged with Unicode who would otherwise be likely to come
> to the conclusion that Unicode is "done" as far as their needs are
> concerned.
>
> Few, if any, of the not-yet-encoded scripts are used by large living
> populations,
> therefore they are not urgently missing / needed in daily life and are of
> interest
> primarily to specialists.
>
> Emoji are definitely up-ending that dynamic, which I would argue is a good
> thing.
>
> A financially well endowed Consortium with strong membership is a
> prerequisite
> to fulfilling the larger cultural mission of Unicode. Sure, for the
> populations
> whose scripts are already encoded, there are separate issues that will keep
> some interest alive, like solving problems related to algorithms and
> locales, but
> also dealing with extensions of existing scripts and notational systems -
> although
> few enough of those are truly urgent/widely used.
>
> The University of Berkeley people would be the first to tell you how their
> funding
> puncture is positively influenced by the current perceived relevancy of
> the Unicode
> Consortium - much of it being due to those emoji.
>
> A./
>
>
>
>

Re: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-28 Thread Mark Davis ☕️ via Unicode

On Sun, Jan 28, 2018 at 3:20 PM, Doug Ewell <d...@ewellic.org> wrote:

> Mark Davis wrote:
>
> One addition: with the expansion of keyboards in
>> http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html
>> we are looking to expand the repository to not merely represent those,
>> but to also serve as a resource that vendors can draw on.
>>
>
> Would you say, then, that Marcel's statements:
>
> "Now that CLDR is sorting out how to improve keyboard layouts, hopefully
> something falls off to replace the *legacy* US-Intl."
>

> and:
>
> "We can only hope that now, CLDR is thoroughly re-engineering the way
> international or otherwise extended keyboards are mapped."
>
> reflect the situation accurately?
>
> Nothing in the PRI #367 blog post or background document communicated to
> me that CLDR was going to try to influence vendors to retire these keyboard
> layouts and replace them with those. I thought it was just about providing
> a richer CLDR format and syntax to better "support keyboard layouts from
> all major providers." Please point me to the part I missed.

Your message didn't quote
the part about «replace the *legacy* US-Intl."»

The PRI blog post is talking about the technical changes, not process. The
goal there is to be able to represent keyboard structures and data in a
"lingua franca", and to expand the features needed to cover more languages
and more vendor requirements. Of course, more extensions will be needed in
the future, as well.

As far as process goes, we foresee (a) continuing to reflect what is being
used in practice, and (b) extending to a repository for keyboards for
languages that are not represented by current vendors. That is to enable
vendors to easily add keyboards for support of additional languages, if
they want.

It is not a goal to get "vendors to retire these keyboard layouts and
replace them" — that's not our role. (And I'm sure that a lot of people
like and would continue to use the Windows Intl keyboard.)

It's more about making it easier to have more choice available for users:
more languages, and more choice of layouts within a language that meet
people's needs.

>
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>

Re: Keyboard layouts and CLDR (was: Re: 0027, 02BC, 2019, or a new character?)

2018-01-28 Thread Mark Davis ☕️ via Unicode

One addition: with the expansion of keyboards in
http://blog.unicode.org/2018/01/unicode-ldml-keyboard-enhancements.html we
are looking to expand the repository to not merely represent those, but to
also serve as a resource that vendors can draw on.

Mark

On Sun, Jan 28, 2018 at 1:11 PM, Doug Ewell via Unicode  wrote:

> Marcel Schneider wrote:
>
> We can only hope that now, CLDR is thoroughly re-engineering the way
>> international or otherwise extended keyboards are mapped.
>>
>
> I suspect you already know this and just misspoke, but CLDR doesn't
> prescribe any vendor's keyboard layouts. CLDR mappings reflect what vendors
> have released.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>

Re: [HUMOR] Proof that emojis are useful

2018-01-27 Thread Mark Davis ☕️ via Unicode

Nice, thanks!

Mark

On Sat, Jan 27, 2018 at 7:31 AM, Stephane Bortzmeyer via Unicode <
unicode@unicode.org> wrote:

> Nice scientific info, and with emojis :
>
> https://twitter.com/biolojical/status/956953421130514432
>

Re: 0027, 02BC, 2019, or a new character?

2018-01-25 Thread Mark Davis ☕️ via Unicode

My apologies for the typo. There's no excuse for misspelling someone's name
(especially since I live in Switzerland, and type German every day).

Thanks for calling my attention to it: the doc has been updated.

Mark

Mark

On Thu, Jan 25, 2018 at 4:15 AM, Andrew West via Unicode <
unicode@unicode.org> wrote:

> On 23 January 2018 at 00:55, James Kass via Unicode 
> wrote:
> >
> > Regular American users simply don't type umlauts, period.
>
> Not even the president of the Unicode Consortium when referring to
> Christoph Päper:
>
> http://www.unicode.org/L2/L2018/18051-emoji-ad-hoc-resp.pdf
>
> Andrew
>
>

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2018-01-22 Thread Mark Davis ☕️ via Unicode

Good point, thanks

Mark

On Mon, Jan 22, 2018 at 6:41 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sun, 21 Jan 2018 22:34:12 -0800
> Mark Davis ☕️ via Unicode <unicode@unicode.org> wrote:
>
> > The ZWJ Virama sequence is already provided for by the combination of
> > GB9 & GB9c. But not the ZWNJ. If we want to handle that, it would
> > mean the addition of something like:
> >
> > GB9d: × (ZWNJ ViramaExtend* Virama)
>
> I don't think we need ViramaExtend* here.  The seqeunce should be
> followed by a base consonant, so there's no way for another mark to
> sneak in.
>
> Incidentally, I think ViramaExtend would be better named as NSExtend,
> with 'NS' for 'non-starter'.
>
> Richard.
>
>

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2018-01-21 Thread Mark Davis ☕️ via Unicode

I was looking the feedback in http://www.unicode.org/review/pri355/, and
didn't see yours there. Could you please file your feedback there? (Nothing
on this list is tracked by the committee...)

FYI, I'm thinking now that the change should be:

GB9c: (Virama | ZWJ )   × LinkingConsonant
=>
GB9c: (Virama ViramaExtend* | ZWJ ) × LinkingConsonant

where ViramaExtend = [Extend - Virama - \p{ccc=0}]
(This is pre-partitioning.)

That is close to your formulation, but for for canonical equivalence, there
shouldn't need to allow the ViramaExtend after ZWJ, because the ZWJ has
ccc=0, and thus nothing reorders around it.

Cibu also pointed out on a different thread that for Malayalam we need to
consider a couple of other forms:

... Following contexts should be allowed for requesting reformed or
traditional conjuncts as per Unicode10.0.0/ch12 page 505.  ...

/$L ZWNJ $V $L/
/$L ZWJ $V $L/

The ZWJ Virama sequence is already provided for by the combination of GB9
& GB9c. But not the ZWNJ. If we want to handle that, it would mean the
addition of something like:

GB9d: × (ZWNJ ViramaExtend* Virama)

Cibu also wrote:

Also, when we disallow /$L $V ZWJ $D/, it is disallowing the sequences
involving legacy chillus. That is, for example,  is
a valid sequence (Examples in Unicode10.0.0/ch12 Table 12.36). It's legacy
equivalent would be <NA, VIRAMA, ZWJ, VOWEL SIGN E>. It might be OK to
disallow this; but, we should be mindful of this side effect.

To account for the legacy cases, the simplest approach might be to add
some characters to GCB=
LinkingConsonant

Note:
The final date for deciding exactly what to do with #29 will be in April,
so there is some more time to discuss this. But we have to have a pretty
solid proposal going into that April meeting. 
The only test files that we have gotten from India so far include
Devanagari, Malayalam and Bengali. I suspect that the UTC is likely to be
conservative, and limit the GCB=Virama category to just those scripts that
we have test files for
, and that look complete.

Mark

On Mon, Dec 11, 2017 at 2:16 AM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sun, 10 Dec 2017 21:14:18 -0800
> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
>
> > > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
> >
> > You can also explicitly request ligatureification with a ZWJ, so
> > perhaps this rule should be something like
> >
> > (Virama ZWJ? | ZWJ) x Extend* LinkingConsonant
> >
> > -Manish
> >
> > On Sat, Dec 9, 2017 at 7:16 AM, Mark Davis ☕️ via Unicode <
> > unicode@unicode.org> wrote:
> >
> > > 1. You make a good point about the GB9c. It should probably instead
> > > be something like:
> > >
> > > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
>
> This change is unnecessary.  If we start from Draft 1 where there are:
>
> GB9:×   (Extend | ZWJ | Virama)
> GB9c:   (Virama | ZWJ ) ×   LinkingConsonant
>
> If the classes used in the rules are to be disjoint, we then have to
> split Extend into something like ViramaExtend and OtherExtend to allow
> normalised (NFC/NFD) text, at which point we may as well continue to
> have rules that work without any normalisation. Informally,
>
> ViramaExtend = Extend and ccc ≠ 0.
>
> OtherExtend = Extend and ccc = 0.
>
> (We might need to put additional characters in ViramaExtend.)
>
> This gives us rules:
>
> GB9': × (OtherExtend | ViramaExtend | ZWJ | Virama)
>
> GB9c':  (Virama | ZWJ ) ViramaExtend* × LinkingConsonant
>
> So, for a sequence <virama, ZWJ, nukta, LinkingConsonant>, GB9' gives us
>
> virama × ZWJ × nukta LinkingConsonant
>
> and GB9c' gives us
>
> virama × ZWJ × nukta × LinkingConsonant
>
> ---
> In Rule GB9c, what examples justify including ZWJ?  Are they just the C1
> half-forms?  My knowledge suggests that
>
> GB9c'': Virama (ZWJ | ViramaExtend)* × LinkingConsonant
>
> might be more appropriate.
>
> Richard.
>
>

Re: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10))

2018-01-05 Thread Mark Davis ☕️ via Unicode

Doug, I modified my working draft, at
https://docs.google.com/document/d/1EuNjbs0XrBwqlvCJxra44o3EVrwdBJUWsPf8Ec1fWKY

If that looks ok, I'll submit.

Thanks again for your comments.

Mark

Mark

On Wed, Jan 3, 2018 at 9:29 AM, Mark Davis ☕️ <m...@macchiato.com> wrote:

> Thanks for your comments; you raise an excellent issue. There are valid
> sequences that are not RGI; a vendor can support additional emoji sequences
> (in particular, flags). So the wording in the doc isn't correct.
>
> It should do something like replace the use of "testing for RGI" by
> "testing for validity". The key areas involved in that are checking for the
> valid base+modifier combinations, valid RI pairs, and TAG sequences. The
> latter two involve testing based on the information applied in the
> appendix, while the valid base+modifiers are more regular and can be tested
> based on properties.
>
>
> Mark
>
> On Tue, Jan 2, 2018 at 9:55 PM, Doug Ewell via Unicode <
> unicode@unicode.org> wrote:
>
>> Mark Davis wrote:
>>
>> BTW, relevant to this discussion is a proposal filed
>>> http://www.unicode.org/L2/L2017/17434-emoji-rejex-uts51-def.pdf (The
>>> date is wrong, should be 2017-12-22)
>>>
>>
>> The phrase "emoji regex" had caused me to ignore this document, but I
>> took a look based on this thread. It says "we still depend on the RGI test
>> to filter the set of emoji sequences" and proposes that the EBNF in UTS #51
>> be simplified on the basis that only RGI sequences will pass the "possible
>> emoji" test anyway.
>>
>> Thus it is true, as some people have said (i.e. in L2/17‐382), that
>> non-RGI sequences do not actually count as emoji, and therefore there is no
>> way — not merely no "recommended" way — to represent the flags of entities
>> such as Catalonia and Brittany.
>>
>> In 2016 I had asked for the emoji tag sequence mechanism for flags to be
>> available for all CLDR subdivisions, not just three, with the understanding
>> that the vast majority would not be supported by vendor glyphs. II t is
>> unfortunate that, while the conciliatory name "recommended" was adopted for
>> the three, the intent of "exclusively permitted" was retained.
>>
>> --
>> Doug Ewell | Thornton, CO, US | ewellic.org
>>
>>
>

Re: Regex for Grapheme Cluster Breaks

2018-01-03 Thread Mark Davis ☕️ via Unicode

Quick update: Manish pointed out that I'd misstated one of the rules,
should be:

skin-sequence = $E_Base $Extend* $E_Modifier ;

With that change, the test passes. (Thanks Manish!)

Mark

On Wed, Jan 3, 2018 at 10:16 AM, Mark Davis ☕️ <m...@macchiato.com> wrote:

> I had a UTC action to adjust http://www.unicode.org/
> reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_
> Clusters to update the regex, and other necessary changes surrounding
> text.
>
> Here is what I've come up with for an EBNF formulation. The $x are the GCB
> properties.
>
> cluster = crlf | $Control | precore* core postcore* ;
>
>
> crlf = $CR $LF ;
>
>
> precore =  $Prepend ;
>
>
> postcore = (?: virama-sequence | [$Extend $ZWJ $Virama $SpacingMark] );
>
>
> core = (?: hangul-syllable | ri-sequence | xpicto-sequence | virama-sequence
> | [^$Control $CR $LF] );
>
>
> hangul-syllable = $L* (?:$V+ | $LV $V* | $LVT) $T* | $L+ | $T+ ;
>
>
> ri-sequence = $RI $RI ;
>
>
>
> skin-sequence = $E_Base $E_Modifier ;
>
>
> xpicto-sequence = (?: skin-sequence | \p{Extended_Pictographic} ) (?:
> $Extend* $ZWJ (?: skin-sequence | \p{Extended_Pictographic} ))* ;
>
>
> virama-sequence = [$Virama $ZWJ] $LinkingConsonant ;
>
>
> I have tools to turn that into a (lovely) regex:
>
> \p{gcb=cr}\p{gcb=lf}|\p{gcb=control}|\p{gcb=Prepend}*(?:\
> p{gcb=l}*(?:\p{gcb=v}+|\p{gcb=lv}\p{gcb=v}*|\p{gcb=lvt})\p{
> gcb=t}*|\p{gcb=l}+|\p{gcb=t}+|\p{gcb=ri}\p{gcb=ri}|(?:\p{
> gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_Pictographic})(?:\
> p{gcb=Extend}*\p{gcb=zwj}(?:\p{gcb=e_base}\p{gcb=E_Modifier}|\p{Extended_
> Pictographic}))*|[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=
> LinkingConsonant}|[^\p{gcb=control}\p{gcb=cr}\p{gcb=lf}])
> (?:[\p{gcb=Virama}\p{gcb=zwj}]\p{gcb=LinkingConsonant}|[\p{
> gcb=Extend}\p{gcb=zwj}\p{gcb=Virama}\p{gcb=SpacingMark}])*
> 
> (It is a bit shorter if some more property names/values are abbreviated.)
>
> I then tested against the current test file: GraphemeBreakTest.txt. There
> is one outlying failure with that test file:
>
> 813) ☝̈
>
> hex: 261D 0308 1F3FB
>
> test: [0, 4]
>
> ebnf: [0, 2, 4]
>
> I believe that is a problem with the test rather than the BNF, but I need
> to track it down in any event.
>
> A regex is much easier for many applications to use than the current rule
> syntax, so I'm going to see if the other segmentations could be
> reformulated as ebnfs (ideally corresponding to regular grammars, or in
> the worst case, for PEGs).
>
> Feedback is welcome.
>
> 
> Mark
>

Re: Non-RGI sequences are not emoji? (was: Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10))

2018-01-03 Thread Mark Davis ☕️ via Unicode

Thanks for your comments; you raise an excellent issue. There are valid
sequences that are not RGI; a vendor can support additional emoji sequences
(in particular, flags). So the wording in the doc isn't correct.

It should do something like replace the use of "testing for RGI" by
"testing for validity". The key areas involved in that are checking for the
valid base+modifier combinations, valid RI pairs, and TAG sequences. The
latter two involve testing based on the information applied in the
appendix, while the valid base+modifiers are more regular and can be tested
based on properties.

Mark

On Tue, Jan 2, 2018 at 9:55 PM, Doug Ewell via Unicode <unicode@unicode.org>
wrote:

> Mark Davis wrote:
>
> BTW, relevant to this discussion is a proposal filed
>> http://www.unicode.org/L2/L2017/17434-emoji-rejex-uts51-def.pdf (The
>> date is wrong, should be 2017-12-22)
>>
>
> The phrase "emoji regex" had caused me to ignore this document, but I took
> a look based on this thread. It says "we still depend on the RGI test to
> filter the set of emoji sequences" and proposes that the EBNF in UTS #51 be
> simplified on the basis that only RGI sequences will pass the "possible
> emoji" test anyway.
>
> Thus it is true, as some people have said (i.e. in L2/17‐382), that
> non-RGI sequences do not actually count as emoji, and therefore there is no
> way — not merely no "recommended" way — to represent the flags of entities
> such as Catalonia and Brittany.
>
> In 2016 I had asked for the emoji tag sequence mechanism for flags to be
> available for all CLDR subdivisions, not just three, with the understanding
> that the vast majority would not be supported by vendor glyphs. II t is
> unfortunate that, while the conciliatory name "recommended" was adopted for
> the three, the intent of "exclusively permitted" was retained.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>

Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2018-01-02 Thread Mark Davis ☕️ via Unicode

BTW, relevant to this discussion is a proposal filed http://www.unicode.org/
L2/L2017/17434-emoji-rejex-uts51-def.pdf (The date is wrong, should
be 2017-12-22)

Mark

On Tue, Jan 2, 2018 at 11:41 AM, Mark Davis ☕️ <m...@macchiato.com> wrote:

> We had that originally, but some people objected that some languages
> (Arabic, as I recall) can end a string of letters with a ZWJ, and
> immediately follow it by an emoji (without an intervening space) without
> wanting it to be joined into a grapheme cluster with a following symbol.
> While I personally consider that a degenerate case, we tightened the
> definition to prevent that.
>
> Mark
>
> Mark
>
> On Tue, Jan 2, 2018 at 10:41 AM, Manish Goregaokar <man...@mozilla.com>
> wrote:
>
>> In the current draft GB11 mentions Extended_Pictographic Extend* ZWJ x
>> Extended_Pictographic.
>>
>> Can this similarly be distilled to just ZWJ x Extended_Pictographic? This
>> does affect cases like  or > letter, zwj, emoji> and I'm not certain if that counts as a degenerate
>> case. If we do this then all of the rules except the flag emoji one become
>> things which can be easily calculated with local information, which is nice
>> for implementors.
>>
>> (Also in the current draft I think GB11 needs a `E_Modifier?` somewhere
>> but if we merge that with Extend that's not going to be necessary anyway)
>>
>> -Manish
>>
>> On Tue, Jan 2, 2018 at 3:02 PM, Manish Goregaokar <man...@mozilla.com>
>> wrote:
>>
>>> > Note: we are already planning to get rid of the GAZ/EBG distinction (
>>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>>
>>>
>>> This is great! I hadn't noticed this when I last saw that draft (I was
>>> focusing on the Virama stuff). Good to know!
>>>
>>>
>>> > Instead, we'd add one line to
>>> *Extend <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>>
>>> Yeah, this is essentially what I was hoping we could do.
>>>
>>> Is there any way to formally propose this? Or is bringing it up here
>>> good enough?
>>>
>>> Thanks,
>>>
>>> -Manish
>>>
>>> On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ☕️ via Unicode <
>>> unicode@unicode.org> wrote:
>>>
>>>> This is an interesting suggestion, Manish.
>>>>
>>>> <non-emoji-base, skin tone modifier> is a degenerate case, so if we
>>>> following your suggestion we also could drop E_Base and E_Modifier, and
>>>> rule GB10.
>>>>
>>>> Instead, we'd add one line to *Extend
>>>> <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>>>
>>>> OLD
>>>> Grapheme_Extend = Yes
>>>> *and not* GCB = Virama
>>>>
>>>> NEW
>>>> Grapheme_Extend = Yes, or
>>>> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [
>>>> UTS51 <http://www.unicode.org/reports/tr41/tr41-21.html#UTS51>].
>>>> *and not* GCB = Virama
>>>>
>>>> Note: we are already planning to get rid of the GAZ/EBG distinction (
>>>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>>>
>>>> Mark
>>>>
>>>> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode <
>>>> unicode@unicode.org> wrote:
>>>>
>>>>> On Mon, 1 Jan 2018 13:24:29 +0530
>>>>> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
>>>>>
>>>>> >  sounds very much like a
>>>>> > degenerate case to me.
>>>>>
>>>>> Generally yes, but I'm not sure that they'd be inappropriate for
>>>>> Egyptian hieroglyphs showing human beings.  The choice of determinative
>>>>> can convey unpronounceable semantic information, though I'm not sure
>>>>> that that can be as sensitive as skin colour.  However, in such a case
>>>>> it would also be appropriate to give a skin tone modifier the property
>>>>> Extend.
>>>>>
>>>>> Richard.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2018-01-02 Thread Mark Davis ☕️ via Unicode

We had that originally, but some people objected that some languages
(Arabic, as I recall) can end a string of letters with a ZWJ, and
immediately follow it by an emoji (without an intervening space) without
wanting it to be joined into a grapheme cluster with a following symbol.
While I personally consider that a degenerate case, we tightened the
definition to prevent that.

Mark

Mark

On Tue, Jan 2, 2018 at 10:41 AM, Manish Goregaokar <man...@mozilla.com>
wrote:

> In the current draft GB11 mentions Extended_Pictographic Extend* ZWJ x
> Extended_Pictographic.
>
> Can this similarly be distilled to just ZWJ x Extended_Pictographic? This
> does affect cases like  or  letter, zwj, emoji> and I'm not certain if that counts as a degenerate
> case. If we do this then all of the rules except the flag emoji one become
> things which can be easily calculated with local information, which is nice
> for implementors.
>
> (Also in the current draft I think GB11 needs a `E_Modifier?` somewhere
> but if we merge that with Extend that's not going to be necessary anyway)
>
> -Manish
>
> On Tue, Jan 2, 2018 at 3:02 PM, Manish Goregaokar <man...@mozilla.com>
> wrote:
>
>> > Note: we are already planning to get rid of the GAZ/EBG distinction (
>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>
>>
>> This is great! I hadn't noticed this when I last saw that draft (I was
>> focusing on the Virama stuff). Good to know!
>>
>>
>> > Instead, we'd add one line to
>> *Extend <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>
>> Yeah, this is essentially what I was hoping we could do.
>>
>> Is there any way to formally propose this? Or is bringing it up here good
>> enough?
>>
>> Thanks,
>>
>> -Manish
>>
>> On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ☕️ via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> This is an interesting suggestion, Manish.
>>>
>>> <non-emoji-base, skin tone modifier> is a degenerate case, so if we
>>> following your suggestion we also could drop E_Base and E_Modifier, and
>>> rule GB10.
>>>
>>> Instead, we'd add one line to *Extend
>>> <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>>
>>> OLD
>>> Grapheme_Extend = Yes
>>> *and not* GCB = Virama
>>>
>>> NEW
>>> Grapheme_Extend = Yes, or
>>> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [
>>> UTS51 <http://www.unicode.org/reports/tr41/tr41-21.html#UTS51>].
>>> *and not* GCB = Virama
>>>
>>> Note: we are already planning to get rid of the GAZ/EBG distinction (
>>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>>
>>> Mark
>>>
>>> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode <
>>> unicode@unicode.org> wrote:
>>>
>>>> On Mon, 1 Jan 2018 13:24:29 +0530
>>>> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
>>>>
>>>> >  sounds very much like a
>>>> > degenerate case to me.
>>>>
>>>> Generally yes, but I'm not sure that they'd be inappropriate for
>>>> Egyptian hieroglyphs showing human beings.  The choice of determinative
>>>> can convey unpronounceable semantic information, though I'm not sure
>>>> that that can be as sensitive as skin colour.  However, in such a case
>>>> it would also be appropriate to give a skin tone modifier the property
>>>> Extend.
>>>>
>>>> Richard.
>>>>
>>>
>>>
>>
>

Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2018-01-02 Thread Mark Davis ☕️ via Unicode

> Or is bringing it up here good enough?

You should submit a proposal, which you can do at
https://www.unicode.org/reporting.html. It doesn't have to be much more
than what you put in email.

(A reminder for everyone here: This is simply a discussion list, and has no
effect whatsoever unless someone submits a proposal for the UTC.)

Mark

On Tue, Jan 2, 2018 at 10:32 AM, Manish Goregaokar <man...@mozilla.com>
wrote:

> > Note: we are already planning to get rid of the GAZ/EBG distinction (
> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>
>
> This is great! I hadn't noticed this when I last saw that draft (I was
> focusing on the Virama stuff). Good to know!
>
>
> > Instead, we'd add one line to
> *Extend <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>
> Yeah, this is essentially what I was hoping we could do.
>
> Is there any way to formally propose this? Or is bringing it up here good
> enough?
>
> Thanks,
>
> -Manish
>
> On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ☕️ via Unicode <
> unicode@unicode.org> wrote:
>
>> This is an interesting suggestion, Manish.
>>
>> <non-emoji-base, skin tone modifier> is a degenerate case, so if we
>> following your suggestion we also could drop E_Base and E_Modifier, and
>> rule GB10.
>>
>> Instead, we'd add one line to *Extend
>> <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>
>> OLD
>> Grapheme_Extend = Yes
>> *and not* GCB = Virama
>>
>> NEW
>> Grapheme_Extend = Yes, or
>> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [
>> UTS51 <http://www.unicode.org/reports/tr41/tr41-21.html#UTS51>].
>> *and not* GCB = Virama
>>
>> Note: we are already planning to get rid of the GAZ/EBG distinction (
>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>
>> Mark
>>
>> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> On Mon, 1 Jan 2018 13:24:29 +0530
>>> Manish Goregaokar via Unicode <unicode@unicode.org> wrote:
>>>
>>> >  sounds very much like a
>>> > degenerate case to me.
>>>
>>> Generally yes, but I'm not sure that they'd be inappropriate for
>>> Egyptian hieroglyphs showing human beings.  The choice of determinative
>>> can convey unpronounceable semantic information, though I'm not sure
>>> that that can be as sensitive as skin colour.  However, in such a case
>>> it would also be appropriate to give a skin tone modifier the property
>>> Extend.
>>>
>>> Richard.
>>>
>>
>>
>

Re: Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

2018-01-01 Thread Mark Davis ☕️ via Unicode

This is an interesting suggestion, Manish.

 is a degenerate case, so if we
following your suggestion we also could drop E_Base and E_Modifier, and
rule GB10.

Instead, we'd add one line to *Extend
:*

OLD
Grapheme_Extend = Yes
*and not* GCB = Virama

NEW
Grapheme_Extend = Yes, or
Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [UTS51
].
*and not* GCB = Virama

Note: we are already planning to get rid of the GAZ/EBG distinction (
http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.

Mark

On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Mon, 1 Jan 2018 13:24:29 +0530
> Manish Goregaokar via Unicode  wrote:
>
> >  sounds very much like a
> > degenerate case to me.
>
> Generally yes, but I'm not sure that they'd be inappropriate for
> Egyptian hieroglyphs showing human beings.  The choice of determinative
> can convey unpronounceable semantic information, though I'm not sure
> that that can be as sensitive as skin colour.  However, in such a case
> it would also be appropriate to give a skin tone modifier the property
> Extend.
>
> Richard.
>

Re: Possible bug in formal grammar for extended grapheme cluster

2017-12-18 Thread Mark Davis ☕️ via Unicode

If you look back at http://www.unicode.org/reports/tr29/tr29-27.html#GB8a
(2015), the rule was simply not to break sequences of RI characters.

We changed that in http://www.unicode.org/reports/tr29/tr29-29.html#GB12
(2016) to only group pairs. Unfortunately, the (informative) table
http://www.unicode.org/reports/tr29/tr29-31.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters
was not updated after 2015 to keep pace with the changes in rules. So that
is still to do



Mark <https://twitter.com/mark_e_davis>

On Mon, Dec 18, 2017 at 10:59 AM, Andre Schappo via Unicode <
unicode@unicode.org> wrote:

> Ah! That explains why
>
> pcre2grep -u '^\X{1}$'
>
> matches with
>
> 
> 
> 
> 
>
> ...etc...
>
> André Schappo
>
> On 17 Dec 2017, at 17:17, Mark Davis ☕️ via Unicode <unicode@unicode.org>
> wrote:
>
> Thanks for the feedback. You're correct about this; that is a holdover
> from an earlier version of the document when there was a more basic
> treatment of RI sequences.
>
> There is already an action to modify these. There is a placeholder review
> note about that just above
>
> http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_
> Sequences_and_Grapheme_Clusters
>
> (scroll up just a bit).
>
> Mark
>
> Mark <https://twitter.com/mark_e_davis>
>
> On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <
> unicode@unicode.org> wrote:
>
>> Hi,
>>
>> It’s possible I’m missing something, but the formal grammar/regular
>> expression given for extended grapheme clusters appears to have a bug
>> in it.
>> <https://unicode.org/reports/tr29/#Table_Combining_Char_Sequ
>> ences_and_Grapheme_Clusters>
>>
>> The bug is here:
>>
>> RI-Sequence := Regional_Indicator+
>>
>> If the formal grammar is intended to exactly match the rules given the
>> the “Grapheme Cluster Boundary Rules” section below it as-is, then
>> this should be
>>
>> RI-Sequence := Regional_Indicator Regional_Indicator
>>
>> since as given it would cause any number of RI characters to coalesce
>> into a single grapheme cluster, instead of pairs of characters. That
>> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
>> grapheme cluster instead of the correct two.
>>
>> --
>> dpk (David P. Kendal) · Nassauische Str. 36, 10717
>> <https://maps.google.com/?q=Nassauische+Str.+36,+10717=gmail=g>
>> DE · http://dpk.io/
>>we do these things not because they are easy,  +49 159 03847809
>>   but because we thought they were going to be easy
>>   — ‘The Programmers’ Credo’, Maciej Cegłowski
>>
>>
>>
>
>   
> André Schappo
> https://schappo.blogspot.co.uk
> https://twitter.com/andreschappo
> https://weibo.com/andreschappo
> https://groups.google.com/forum/#!forum/computer-science-curriculum-
> internationalization
>
>
>
>
>
>

Re: Possible bug in formal grammar for extended grapheme cluster

2017-12-17 Thread Mark Davis ☕️ via Unicode

Thanks for the feedback. You're correct about this; that is a holdover from
an earlier version of the document when there was a more basic treatment of
RI sequences.

There is already an action to modify these. There is a placeholder review
note about that just above

http://www.unicode.org/reports/tr29/proposed.html#Table_Combining_Char_Sequences_and_Grapheme_Clusters

(scroll up just a bit).

Mark

Mark 

On Sun, Dec 17, 2017 at 4:16 PM, David P. Kendal via Unicode <
unicode@unicode.org> wrote:

> Hi,
>
> It’s possible I’m missing something, but the formal grammar/regular
> expression given for extended grapheme clusters appears to have a bug
> in it.
>  Sequences_and_Grapheme_Clusters>
>
> The bug is here:
>
> RI-Sequence := Regional_Indicator+
>
> If the formal grammar is intended to exactly match the rules given the
> the “Grapheme Cluster Boundary Rules” section below it as-is, then
> this should be
>
> RI-Sequence := Regional_Indicator Regional_Indicator
>
> since as given it would cause any number of RI characters to coalesce
> into a single grapheme cluster, instead of pairs of characters. That
> is, the text U+1F1EC U+1F1E7 U+1F1EA U+1F1FA would represent one
> grapheme cluster instead of the correct two.
>
> --
> dpk (David P. Kendal) · Nassauische Str. 36, 10717 DE · http://dpk.io/
>we do these things not because they are easy,  +49 159 03847809
>   but because we thought they were going to be easy
>   — ‘The Programmers’ Credo’, Maciej Cegłowski
>
>
>

Re: Word_Break for Hieroglyphs

2017-12-14 Thread Mark Davis ☕️ via Unicode

Mark <https://twitter.com/mark_e_davis>

On Thu, Dec 14, 2017 at 3:22 PM, Michael Everson <ever...@evertype.com>
wrote:

> On 14 Dec 2017, at 14:14, Mark Davis ☕️ via Unicode <unicode@unicode.org>
> wrote:
>
> > The Word_Break property doesn't have a value Complex_Context, but I
> think that was just a typo in your message.
> >
> > The word break and line break properties for 1,057 [:Script=Egyp:]
> characters are currently
> >
> > Word_Break=ALetter
> > Line_Break=Alphabetic
> >
> > Off the top of my head, I think the best course would be to make them
> both the same as for most of [:Script=Hani:]
> >
> > Word_Break=Other
> > Line_Break=Ideographic
>
> Egyptian is not ideographic and is certainly not fixed-width. CJK does not
> cluster. Why should you want to make them the same?

fixed-width has *nothing* to do with these properties. The issue is
whether spaces are required between words. The impact of the *these* properties
with their current values are that

   - you would never break a word within a string of hieroglyphs (eg
   double-click) and
   - you would only break within a string of hieroglyphs if there are no
   spaces, etc. on the line.

For example, if you have a string of 300 hieroglyphs in a paragraph, double
clicking on one of them would select the entire string, because as far as
Word_Break is concerned, the entire 300 characters form one word. For
linebreak, you would only break when forced. So in a paragraph of passages
of English + hieroglyphs (represented here by CAPS), you would only break
at the spaces and when forced. For example, suppose we have:

... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEWLQNFNNAKDFNFNQKLER
is constructed from 15 words with...

It would not line break (with the current properties) as:

... the passage ABCJKQELRKLQNEKLAFNKLAEFNKLAREN
KQLNRKEWLQNFNNAKDFNFNQKLER is constructed from
15 words with...

but rather as:

... the passage
ABCJKQELRKLQNEKLAFNKLAEFNKLARENKQLNRKEW
LQNFNNAKDFNFNQKLER is constructed from 15 words with...

> Moreover, these properties were defined at the beginning, were they not?
> Bob Richmond and others will certainly have a view on this.
>

If there is defined clustering behavior that affects line break, then the
line break property value would need to be Complex_Context.

But the *current* value is Alphabetic, which makes any length of
hieroglyphs function as one (possibly very long) word. That appears clearly
wrong, even if it was "defined at the beginning". Properties are not carved
in stone (so to speak); we sometimes find out later, especially for seldom
used scripts, that property values can be improved.

> > We would only need to use Complex_Context [:lb=SA:] for scripts that
> keep some letters together and break others apart (typically needing
> dictionary lookup). I would suspect for modern use of Egyp, that is not the
> case;
>
> Please do not “suspect”. It is not hard to ask experts.
>

You misunderstand. When I say "I suspect" that means I'm not certain. Thus
I would like people who are both knowledgeable about hieroglyphs *and*
Unicode properties to weigh in. I know that people like Andrew Glass are on
this list, who satisfy both criteria.

>
> > most people would expect the characters to would just flow like
> ideographs, breaking between any pair:
>
> NO. Clusters cannot be broken up just anywhere.
>

A simple assertion without more information is useless.

Does that mean that ancient inscriptions would leave gaps at the end of
lines in order to not break a cluster, or that modern users would expect
software to leave gaps at the end of lines in order to not break a
cluster? And what constitutes a cluster? Is that semantically determined
(eg like Thai), or is it based on algorithmic features of the hieroglyphs?

> > you wouldn't need to disallow breaks between a  with an axe> and a , for example.
> >
> > Also, I noticed that the 14 Egyp characters with Line_Break≠Alphabetic
> have a linebreak and general category properties that seem odd and
> inconsistent to me.
> >
> > Line_Break=Close_Punctuation
> > General_Category=Other_Letteritems: 8
> > Egyptian Hieroglyphs — O. Buildings, parts of buildings, etc.items: 6
> >
> >  ㉛   U+1325B EGYPTIAN HIEROGLYPH O006D
> >  ㉜   U+1325C EGYPTIAN HIEROGLYPH O006E
> >  ㉝   U+1325D EGYPTIAN HIEROGLYPH O006F
> >  ㊂   U+13282 EGYPTIAN HIEROGLYPH O033A
> >  ㊇   U+13287 EGYPTIAN HIEROGLYPH O036B
> >  ㊉   U+13289 EGYPTIAN HIEROGLYPH O036D
> > Egyptian Hieroglyphs — V. Rope, fiber, baskets, bags, etc.items: 2
> >
> >  ㍺   U+1337A EGYPTIAN HIEROGLYPH V011B
> >  ㍻   U+1337B EGYPTIAN HIEROGLYPH V011C
> > Line_Break=Open_Punctuation
> > General_Category=Other_Letter

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2017-12-11 Thread Mark Davis ☕️ via Unicode

The proposed rules do not distinguish the different visual forms that a
sequence of characters surrounding a virama can have, such as

   1. an explicit virama, or
   2. a half-form is visible, or
   3. a ligature is created.

That is following the requested structure in
http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf.

So with these rules a ZWNJ (see Figure 12-3. Preventing Conjunct Forms in
Devanagari <http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G14632>)
doesn't
break a GC, nor do instances where a particular script always shows an
explicit virama between two particular consonants. All the lines on Figure
12-7. Consonant Forms in Devanagari and Oriya
<http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#G59257> having a
virama would have single GCs (that is, all but the first line). [That,
after correcting the rules as per Manish Goregaokar's feedback, thanks!]

The examples in "Annexure B" of 17200-text-seg-rec.pdf
<http://www.unicode.org/L2/L2017/17200-text-seg-rec.pdf> clearly include #2
and #3, but don't have any examples of #1 (as far as I can tell from a
quick scan). It would be very useful to have explicit examples that
included #1, and included scripts other than Devanagari (+swaran,
others). While
the online tool at http://unicode.org/cldr/utility/breaks.jsp can't yet be
used until the Unicode 11 UCD is further along, I have an implementation of
the new rules such that I can take any particular list of words and
generate the breaks. So if someone can supply examples from different
scripts or with different combinations of virama, zwj, zwnj, etc. I can
push out the result to this list.

And yes, we do need review of these for Malayalam (+cibu, others).

If there are scripts for which the rules really don't work (or need more
research before #29 is finalized in May), it is fairly straightforward to
restrict the rule changes by modifying
http://www.unicode.org/reports/tr29/proposed.html#Virama to either exclude
particular scripts or include only particular scripts.

Mark <https://twitter.com/mark_e_davis>

On Sat, Dec 9, 2017 at 9:30 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Sat, 9 Dec 2017 16:16:44 +0100
> Mark Davis ☕️ via Unicode <unicode@unicode.org> wrote:
>
> > 1. You make a good point about the GB9c. It should probably instead be
> > something like:
> >
> > GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant
> >
> >
> > Extend is a broader than necessary, and there are a few items that
> > have ccc!=0 but not gcb=extend. But all of those look to be
> > degenerate cases.
>
> Something *like*.
>
> Gcb=Extend includes ZWNJ and U+0D02 MALAYALAM SIGN ANUSVARA.  I believe
> these both prevent a preceding candrakkala from extending an akshara -
> see TUS Section 12.9 about Table 12-33.  I think Extend will have to be
> split between starters and non-starters.
>
> I believe there is a problem with the first two examples in Table
> 12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM
> VOWEL SIGN AA> to the first two examples, yielding *പാലു്കാ and
> *എ്ന്നാകാ, one would have three Malayalam aksharas, not two extended
> grapheme clusters as the proposed rules would say. This is different to
> Tai Tham, where there would indeed just be two aksharas in each word,
> albit odd-looking - ᨷᩤᩃᩩ᩠ᨠᩣ and ᩑ᩠ᨶ᩠ᨶᩣᨠᩣ.  Who's checking the impact of
> these changes on Malayalam?
>
> Richard.
>
>

Re: Aquaφοβία

2017-12-09 Thread Mark Davis ☕️ via Unicode

Some people have been confused by the previous wording, and thought that it
wouldn't be legitimate to break on script boundaries. So we wanted to make
it clear that that was possible, since:

   1. Many implementations of rendering break text into script-runs before
   further processing, and
   2. There are certainly cases where user's expectations are better met
   with breaks on script boundaries*

We thus wanted to make it clear to people that it *is* a legitimate
customization to break on script boundaries.

* Clearly such an approach can't be hard-nosed: an implementation would
need at the very least to handle Common and Inherited specially: not impose
a boundary *because of script* where the SCX value is one of those, either
before or after a break point.

Any suggestions for clarifying language are appreciated.

Mark

Mark 

On Sat, Dec 9, 2017 at 3:28 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Draft 1 of UAX#29 'Unicode Text Segmentation' for Unicode 11.0.0
> implies that it might be considered desirable to have a word boundary
> in 'aquaφοβία' or a grapheme cluster break in a coding such as <006C,
> U+0901 DEVANAGARI SIGN CANDRABINDU> for el candrabindu (l̐), which
> should be <006C, U+0310 COMBINING CANDRABINDU> in accordance with the
> principle of script separation.  Why are such breaks desirable?
>
> I can understand an argument that these should be tolerated, as an
> application could have been designed on the basis that script
> boundaries imply word boundaries (not true for Japanese) and that word
> boundaries imply grapheme cluster boundaries (not true for Sanskrit,
> where they don't even imply character boundaries.)  There are some who
> claim that the Laotian consonant place holder is the letter 'x' rather
> than the multiplication sign, U+00D7, which does have
> Indic_syllabic_category=Consonant_Placeholder. (I trust no-one is
> suggesting that there should be grapheme cluster boundary between
> U+00D7 with script=common and a non-spacing Lao vowel any more than
> there would be with a Lao consonant.)
>
> Richard.
>
>

Re: Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

2017-12-09 Thread Mark Davis ☕️ via Unicode

1. You make a good point about the GB9c. It should probably instead be
something like:

GB9c: (Virama | ZWJ )   × Extend* LinkingConsonant


Extend is a broader than necessary, and there are a few items that have
ccc!=0 but not gcb=extend. But all of those look to be degenerate cases.

https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\p{ccc!=0}-\p{gcb=extend}]=ccc+indicsyllabiccategory



Mark 

On Fri, Dec 8, 2017 at 11:06 PM, Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Apart from the likely but unmandated consequence of making editing
> Indic text more difficult (possibly contrary to the UK's Equality Act
> 2010), there is another difficulty that will follow directly from the
> currently proposed expansion of grapheme clusters
> (https://www.unicode.org/reports/tr29/proposed.html).
>
> Unless I am missing something, text boundaries have hitherto been
> cunningly crafted so that they are not changed by normalisation.
> Have I missed something, or has there been a change in policy?
>
> For extended grapheme clusters, the relevant rules are proposed as:
>
> GB9: ×  (Extend | ZWJ | Virama)
>
> GB9c: (Virama | ZWJ )   × LinkingConsonant
>
> Most of the Indian scripts have both nukta (ccc=7) and virama (ccc=9).
> This would lead canonically equivalent text to have strikingly
> different divisions:
>
>  (no break)
>
> but
>
> 
>
> There are other variations on this theme. In Tai Tham, we have the
> following conflict:
>
> natural order, no break:
>
> 
>
> but normalised, there would be a break:
>
> 
>
> From reading the text, it seems that it is expected that the presence
> or absence of a break should be fine-tuned by CLDR language-specific
> rules.  How is this expected to work, e.g. for Saurashtra in Tamil
> script?  (There's no Saurashtra data in Version 32 of CLDR.)  Would the
> root locale now specify the default segmentation rule, rather than
> UAX#29 plus the Unicode Character Database?
>
> Richard.
>
>

Re: ASCII v Unicode

2017-11-05 Thread Mark Davis ☕️ via Unicode

I had some time on the plane this weekend, and generated some more
comprehensive figures that take the following into account:

   1. There are two senses of "Unicode". In the narrow sense, it is only
   the Unicode Standard (ie, Unicode Characters). But it has grown to have a
   more comprehensive sense, including the other two main projects of the
   Unicode Consortium: Unicode CLDR and ICU.
   2. The ca. 3,300 pages that Asmus cited include specification *text*
   alone, but *data/code* (eg, UCD property data, or source code for ICU)
   is a vital part of the projects.

I thus generated a rough comparison where I (a) included CLDR and ICU, and
(b) included data. That gave the following results (where "encoding"
includes both the Unicode Standard *and* UTS's that are aligned with it in
version, including emoji — since that is to be aligned with it).

[image: Inline image 1]

*Caveats*

   - *This is a rough approximation (my flight wasn't all that long...).*
   In particular, don't count on the 3 decimals of precision — that is just
   the spreadsheet charting.
   - For the data files and code files, I filtered by removing # comments,
   collapsing sequences of whitespace into a single space character, trimming
   whitespace, and tossing empty lines. I then counted a page as a total of 3K
   code points. So the page count for data and code is far smaller than simply
   a line count. (Didn't bother dropping // and /*...*/ comments in code.) I
   also excluded .txt files that had the word "test" (case-insensitive) in
   their names.
   - For html pages I took a few samples of PDFs for UTS's and ICU docs,
   and got a count of HTML code points per page for each generated type of
   page, then divided out to get an approximate page count.
   - There were some other filters: for example, for ICU sources I included
   only files of type {"cpp", "c", "h", "ucm", "java"}, since files of type
   "txt" were likely generated from CLDR data. For CLDR I excluded charts and
   Survey Tool pages, since that would have bulked up the CLDR pie-slice
   drammatically.
   - (And by the way, the pie-slice for emoji is not visible in this graph:
   just 0.1%.)

Mark 

On Fri, Nov 3, 2017 at 2:36 AM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 11/3/2017 2:13 AM, Andre Schappo via Unicode wrote:
>
>
> You may find https://twitter.com/andreschappo/status/926163719331176450 
> amusing
> 
>
> André Schappo
>
> You're wildly off in your page count.
>
> The "book" part of Unicode (Core Specification) alone is 1,500 pages. I
> haven't looked at the single file code charts in a while, but I believe you
> get at least that number again. Then add the dozen or so "Annexes" for a
> few hundred additional pages and be happy that nobody prints the Unicode
> Character Database (or the Unihan Database for that matter).
>
> A./
>

Re: Interesting UTF-8 decoder

2017-10-09 Thread Mark Davis ☕️ via Unicode

The paper points out that the input buffer needs to be padded with 3 null
bytes as a precondition.

Mark 

On Mon, Oct 9, 2017 at 10:57 AM, J Decker via Unicode 
wrote:

> that's interesting; however it will segfault if the string ends on a
> memory allocation boundary.  will have to make sure strings are always
> allocated with 3 extra bytes.
>
> 2017-10-09 1:37 GMT-07:00 Martin J. Dürst via Unicode  >:
>
>> A friend of mine sent me a pointer to
>> http://nullprogram.com/blog/2017/10/06/, a branchless UTF-8 decoder.
>>
>> Regards,   Martin.
>>
>
>

Re: Unicode education in Schools

2017-08-25 Thread Mark Davis ☕️ via Unicode

Mark

(https://twitter.com/mark_e_davis)

On Thu, Aug 24, 2017 at 11:01 PM, Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 8/24/2017 10:17 AM, Andre Schappo via Unicode wrote:
>
>> Because there are many systems that can now handle BMP characters but not
>> cannot handle SMP characters.
>>
>> One example being systems that use mysql utf8 (3 byte encoding) and have
>> not yet updated to utf8mb4 (4 byte encoding)
>>
>> So, I consider it important to familiarise students with SMP characters
>> as well as BMP characters. Then when they develop software they will, at
>> the start, be thinking beyond ASCII and Unicode BMP characters.
>>
>
> The thinking "beyond BMP" part only comes in when you work in encoding
> forms where the BMP uses a different number of code units than the SMP (or
> any other non-BMP "page"). This is true for both utf8 and utf16 but not if
> you work in utf32 or in scalar values (as in the posted exercise).
>
>
> The trick with using emoji in this lesson is that the descriptions and
> images are meaningful to any English speaker, so it gets the student to
> learn about character names.
>
> The same exercise would be more of a challenge for students whose native
> tongue is not English.


> The trick with using emoji...

True. For emoji names it would be better to use the CLDR names with
non-anglophone audiences, since those names are available in a number of
languages.

eg http://www.unicode.org/cldr/charts/31/annotations/romance.html# (that
was last release's version; next release will have improvements...)


>
>
> A./
>
>
>> André Schappo
>>
>> On 24 Aug 2017, at 17:45, Shriramana Sharma  wrote:
>>>
>>> So how do you think it matters if the characters are in the BMP or SMP?
>>>
>>
>>
>>
>

Re: Version linking?

2017-08-17 Thread Mark Davis ☕️ via Unicode

>Intermediate versions can't add any new characters, but can add sequences
and properties, including "emojification" of existing characters.
E.g. E4.0 didn't reference any characters from U10.0. It did recognize
*sequences* of existing U9.0 characters.
E5.0 did have the emoji properties of some 10.0 characters a bit ahead of
time, but only after they were completely locked down.
{phone}

Mark

(https://twitter.com/mark_e_davis)

On Thu, Aug 17, 2017 at 3:04 PM, Shriramana Sharma 
wrote:

> Thanks for your reply, but how can characters be used portably if they
> are not part of the published standard yet? Or is it that hereafter
> both Unicode Standard + Unicode Emoji Standard will be parallelly
> portable or something like that?
>
> --
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
>

Re: Version linking?

2017-08-17 Thread Mark Davis ☕️ via Unicode

Emoji versions are (currently) on a somewhat faster schedule than Unicode
:
U10.0 —
 
E5.0, E6.0
 (TBD)
U09.0 — E3.0
,
E4.0
Intermediate versions can't add any new characters, but can add sequences
and properties, including "emojification" of existing characters.


{phone}

On Aug 17, 2017 03:57, "Shriramana Sharma via Unicode" 
wrote:

A propos http://blog.unicode.org/2017/08/unicode-emoji-60-initial-dra
fts-draft.html
I would like to know whether it is intended that Emoji version N will
be always targeted at Unicode version N + 5 and published in year N +
2012.

I did not find the question or answer at
http://unicode.org/faq/emoji_dingbats.html – hence asking here. I hope
I didn't miss something.

--
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-08-03 Thread Mark Davis ☕️ via Unicode

FYI, the UTC retracted the following.

*[151-C19 ]
Consensus:* Modify
the section on "Best Practices for Using FFFD" in section "3.9 Encoding
Forms" of TUS per the recommendation in L2/17-168
, for Unicode
version 11.0.

Mark

(https://twitter.com/mark_e_davis)

On Wed, May 24, 2017 at 3:56 PM, Karl Williamson via Unicode <
unicode@unicode.org> wrote:

> On 05/24/2017 12:46 AM, Martin J. Dürst wrote:
>
>> On 2017/05/24 05:57, Karl Williamson via Unicode wrote:
>>
>>> On 05/23/2017 12:20 PM, Asmus Freytag (c) via Unicode wrote:
>>>
>>
>> Adding a "recommendation" this late in the game is just bad standards
 policy.

>>>
>> Unless I misunderstand, you are missing the point.  There is already a
>>> recommendation listed in TUS,
>>>
>>
>> That's indeed correct.
>>
>>
>> and that recommendation appears to have
>>> been added without much thought.
>>>
>>
>> That's wrong. There was a public review issue with various options and
>> with feedback, and the recommendation has been implemented and in use
>> widely (among else, in major programming language and browsers) without
>> problems for quite some time.
>>
>
> Could you supply a reference to the PRI and its feedback?
>
> The recommendation in TUS 5.2 is "Replace each maximal subpart of an
> ill-formed subsequence by a single U+FFFD."
>
> And I agree with that.  And I view an overlong sequence as a maximal
> ill-formed subsequence that should be replaced by a single FFFD. There's
> nothing in the text of 5.2 that immediately follows that recommendation
> that indicates to me that my view is incorrect.
>
> Perhaps my view is colored by the fact that I now maintain code that was
> written to parse UTF-8 back when overlongs were still considered legal
> input.  An overlong was a single unit.  When they became illegal, the code
> still considered them a single unit.
>
> I can understand how someone who comes along later could say C0 can't be
> followed by any continuation character that doesn't yield an overlong,
> therefore C0 is a maximal subsequence.
>
> But I assert that my interpretation is just as valid as that one.  And
> perhaps more so, because of historical precedent.
>
> It appears to me that little thought was given to the fact that these
> changes would cause overlongs to now be at least two units instead of one,
> making long existing code no longer be best practice.  You are effectively
> saying I'm wrong about this.  I thought I had been paying attention to
> PRI's since the 5.x series, and I don't remember anything about this.  If
> you have evidence to the contrary, please give it. However, I would have
> thought Markus would have dug any up and given it in his proposal.
>
>
>
>>
>> There is no proposal to add a
>>> recommendation "this late in the game".
>>>
>>
>> True. The proposal isn't for an addition, it's for a change. The "late in
>> the game" however, still applies.
>>
>> Regards,   Martin.
>>
>>
>
>

Re: Turtle Graphics Emoji

2017-07-28 Thread Mark Davis ☕️ via Unicode

Producing emoji sticker sets and apps requires no involvement of Unicode or
any other organization.

So you can find out on your own whether there is an audience for your
"Turtle Graphics Emoji".

Mark

(https://twitter.com/mark_e_davis)

On Fri, Jul 28, 2017 at 2:22 PM, William_J_G Overington via Unicode <
unicode@unicode.org> wrote:

> I have been thinking about having Turtle Graphics Emoji as an educational
> and fun idea.
>
> Turtle Graphics Emoji would each be for one turtle graphics command, such
> as forward, right and left and then there could be digits in a text message
> after the emoji character to act as the parameter to the turtle graphics
> command. There could also be a few associated emoji for start, pause and
> stop and for expressing loops.
>
> I am thinking that Turtle Graphics Emoji would be both educational and fun.
>
> William Overington
>
> Friday 28 July 2017
>
>
>
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-31 Thread Mark Davis ☕️ via Unicode

> I do not understand the energy being invested in a case that shouldn't
happen, especially in a case that is a subset of all the other bad cases
that could happen.

I think Richard stated the most compelling reason:

… The bug you mentioned arose from two different ways of
counting the string length in 'characters'.  Having two different
'character' counts for the same string is inviting trouble.


For implementations that emit FFFD while handling text conversion and
repair (ie, converting ill-formed UTF-8 to well-formed), it is best for
interoperability if they get the same results, so that indices within the
resulting strings are consistent across implementations for all the
*correct* characters thereafter.

It would be preferable *not* to have the following:

source = %c0%80abc

Vendor 1:
fixed = fix(source)
fixed == �abc
codepointAt(fixed, 3) == 'b'

Vendor2:
fixed = fix(source)
fixed == ��abc
codepointAt(fixed, 3) =
=
'
c
'

In theory one could just throw an exception. In practice, nobody wants
their browser

to belly up on a webpage with a component that has an ill-formed bit of
UTF-8.

I
n theory one could document everyone's flavor of the month for how many
FFFD's to emit. In practice, that falls apart immediately, since in today's
interconnected world you can't tell which processes get first crack at text
repair.

Mark

On Wed, May 31, 2017 at 7:43 PM, Shawn Steele via Unicode <
unicode@unicode.org> wrote:

> > > In either case, the bad characters are garbage, so neither approach is
> > > "better" - except that one or the other may be more conducive to the
> > > requirements of the particular API/application.
>
> > There's a potential issue with input methods that indirectly edit the
> backing store.  For example,
> > GTK input methods (e.g. function gtk_im_context_delete_surrounding())
> can delete an amount
> > of text specified in characters, not storage units.  (Deletion by
> storage units is not available in this
> > interface.)  This might cause utter confusion or worse if the backing
> store starts out corrupt.
> > A corrupt backing store is normally manually correctable if most of the
> text is ASCII.
>
> I think that's sort of what I said: some approaches might work better for
> some systems and another approach might work better for another system.
> This also presupposes a corrupt store.
>
> It is unclear to me what the expected behavior would be for this
> corruption if, for example, there were merely a half dozen 0x80 in the
> middle of ASCII text?  Is that garbage a single "character"?  Perhaps
> because it's a consecutive string of bad bytes?  Or should it be 6
> characters since they're nonsense?  Or maybe 2 characters because the
> maximum # of trail bytes we can have is 3?
>
> What if it were 2 consecutive 2-byte sequence lead bytes and no trail
> bytes?
>
> I can see how different implementations might be able to come up with
> "rules" that would help them navigate (or clean up) those minefields,
> however it is not at all clear to me that there is a "best practice" for
> those situations.
>
> There also appears to be a special weight given to non-minimally-encoded
> sequences.  It would seem to me that none of these illegal sequences should
> appear in practice, so we have either:
>
> * A bad encoder spewing out garbage (overlong sequences)
> * Flipped bit(s) due to storage/transmission/whatever errors
> * Lost byte(s) due to storage/transmission/coding/whatever errors
> * Extra byte(s) due to whatever errors
> * Bad string manipulation breaking/concatenating in the middle of
> sequences, causing garbage (perhaps one of the above 2 codeing errors).
>
> Only in the first case, of a bad encoder, are the overlong sequences
> actually "real".  And that shouldn't happen (it's a bad encoder after
> all).  The other scenarios seem just as likely, (or, IMO, much more likely)
> than a badly designed encoder creating overlong sequences that appear to
> fit the UTF-8 pattern but aren't actually UTF-8.
>
> The other cases are going to cause byte patterns that are less "obvious"
> about how they should be navigated for various applications.
>
> I do not understand the energy being invested in a case that shouldn't
> happen, especially in a case that is a subset of all the other bad cases
> that could happen.
>
> -Shawn
>
>

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-21 Thread Mark Davis ☕️ via Unicode

I actually didn't see any of this discussion until today. (
unicode@unicode.org mail was going into my spam folder...) I started
reading the thread, but it looks like a lot of it is OT, so just scanned
some of them.

A few brief points:

   1. There is plenty of time for public comment, since it was
targeted at *Unicode
   11*, the release for about a year from now, *not* *Unicode 10*, due this
   year.
   2. When the UTC "approves a change", that change is subject to comment,
   and the UTC can always reverse or modify its approval up until the meeting
   before release date. *So there are ca. 9 months in which to comment.*
   3. The modified text is a set of guidelines, not requirements. So no
   conformance clause is being changed.
   - If people really believed that the guidelines in that section should
  have been conformance clauses, they should have proposed that at
some point.
  - And still can proposal that — as I said, there is plenty of time.


Mark

On Wed, May 17, 2017 at 10:41 PM, Doug Ewell via Unicode <
unicode@unicode.org> wrote:

> Henri Sivonen wrote:
>
> > I find it shocking that the Unicode Consortium would change a
> > widely-implemented part of the standard (regardless of whether Unicode
> > itself officially designates it as a requirement or suggestion) on
> > such flimsy grounds.
> >
> > I'd like to register my feedback that I believe changing the best
> > practices is wrong.
>
> Perhaps surprisingly, it's already too late. UTC approved this change
> the day after the proposal was written.
>
> http://www.unicode.org/L2/L2017/17103.htm#151-C19
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Standaridized variation sequences for the Desert alphabet?

2017-04-06 Thread Mark Davis ☕️

Mark

On Thu, Apr 6, 2017 at 6:11 PM, Michael Everson <ever...@evertype.com>
wrote:

> On 6 Apr 2017, at 16:05, Mark Davis ☕️ <m...@macchiato.com> wrote:
>
> >> I just get frustrated when everyone including the veterans seems to
> forget every bit of precedent that we have for the useful encoding of
> characters.
> >
> > Nobody's forgetting anything. Simply because people disagree with you
> doesn't mean they are forgetful or stupid. One could just as well respond
> that you are forgetting that Unicode is not a glyph standard. Merely
> because a character have multiple shapes is not grounds for disunifying it.
>
> The ignoring of reasonable precedent does not make the UTC seem
> reasonable. In terms of Deseret, the suggestion that characters Ѕ/Ћ/Ѓ/Љ
> with a stroke derived from І are glyph variants of one another simply
> makes no sense at all. We have honed over many years our understanding of
> writing systems, and saying “Oh, Љ-with-stroke and Ѓ-with stroke are
> variant shapes of the same thing”… Anyone can see that this is not true.
>

"Anyone" doesn't matter. What matters is users of Deseret, not you, not
me. If knowledgeable users of Deseret recognize two shapes as representing
the same character, that is what matters. Similarly, users of Fraktur will
recognize that *very* different shapes represent the same Latin character,
while some very similar (to other's eyes) shapes represent different
characters (some of the capitals, for example).


>
> The vexing thing is that one can never rely on consistency in the UTC’s
> approaches to any proposal. I have discussed this with other successful and
> prolific proposal writers. It’s always a coin-toss as to how a proposal
> will be viewed.
>
> The recent instance of adding attested capital letters for ʂ and ʐ is a
> perfect example. We have seen before some desire to see evidence for casing
> pairs (though often it has not been sought.) We have never before seen
> evidence for casing pairs to be thrown out. Case, of course, is a function
> of the Latin script, just as it is of Greek and Cyrillic and Armenian and
> Cherokee and both Georgian scripts and others. The UTC’s refusal to encode
> attested capitals for ʂ and ʐ simply makes no sense.
>

To you.


>
> Your statement "Merely because a character have multiple shapes is not
> grounds for disunifying it” suggests an underlying view that "everything is
> already encoded and additions are disunifications”.


No, not at all. That is a false dichotomy.


> I do not subscribe to this view.
>




>
> Michael Everson

Re: Standaridized variation sequences for the Desert alphabet?

2017-04-06 Thread Mark Davis ☕️

On Thu, Apr 6, 2017 at 4:07 PM, Michael Everson 
wrote:

> I just get frustrated when everyone including the veterans seems to forget
> every bit of precedent that we have for the useful encoding of characters.
>

Nobody's forgetting anything. Simply because people disagree with you
doesn't mean they are forgetful or stupid. One could just as well respond
that you are forgetting that Unicode is *not* a glyph standard. Merely
because a character have multiple shapes is not grounds for disunifying it.

Mark

Re: Proposal to add standardized variation sequences for chess notation

2017-04-04 Thread Mark Davis ☕️

Amusing at this is, hard to believe that people are spending this much time
on an April Fool's posting.

I'm looking forward to similar postings on checkers and go pieces. As a
matter of fact, one that proposes adding new characters for every possible
configuration of a go board would be imaginative.

And I'm looking also forward to the ♖+ZWJ+⬛️  (etc) proposal.

Mark

Mark

On Tue, Apr 4, 2017 at 3:00 PM, Philippe Verdy  wrote:

>
>
> 2017-04-04 1:30 GMT+02:00 Michael Everson :
>
>> On 3 Apr 2017, at 23:07, Asmus Freytag (c)  wrote:
>> >
>> > On 4/3/2017 2:15 PM, Michael Everson wrote:
>> >> On 3 Apr 2017, at 17:16, Asmus Freytag  wrote:
>> >>
>> > The same indirection is at play here.
>> >
>>  This is pure rhetoric, Asmus. It addresses the problem in no way.
>> 
>> >>> Actually it does. I'm amazed that you don't see the connection.
>> >>>
>> >> I’ve never understood you when you back up into that particular kind
>> of abstract rhetoric.
>> >
>> > Sometimes thinking through something in abstract terms actually
>> clarifies the situation.
>>
>> Of course I know that’s your view. It’s just never been an effective
>> communication strategy between you and me generally.
>>
>> >>> The “meaning” of a chess-problem matrix is the whole 8 × 8 board, not
>> the empty dark square at b4 or the white pawn on
>> >
>> > In other words, you assert that partial boards never need to be
>> displayed. (Let's take that as read, then).
>>
>> No, I am sure that a variety of board shapes can be set in plain text
>> with these conventions, though the principle concern is classical chess
>> notation.
>>
>> >> The “problem” the higher-level protocol is supposed to solve is the
>> one where a chess piece of one colour sits in an em-squared zone whether
>> light or dark. In lead type this was a glyph issue. Lead type had just
>> exactly what my proposal has: A piece with in-line text metrics, spaced
>> harmoniously with digits and letters, and square sorts with and without
>> hatching.
>> >
>> > Leaving aside the abstract question whether modeling lead type is ipso
>> facto the best solution in all cases…
>>
>> I think it was a good expedient solution in lead type and that this
>> proposal offers a robust parseable digital version of that solution, and I
>> assert people will make use of that data structure.
>>
>> >> OK, then you support the part of the proposal that applies VS1 and VS2
>> to the chess pieces.
>> >
>> > My statement just was that a proposal where piece + VS should be
>> M-square, piece w/o VS should be generic, might make some sense (and same
>> for a suitable "empty" cell).
>> >
>> > The next question would be whether the alternation in background is
>> best expressed in variation sequences or by some other means.
>>
>> I think the value in the data structures I have described is best
>> retained as text. Anything else just seems it would be simply needlessly
>> complex,
>>
>> > If you never need to show just a single field, then I concede that the
>> main drawback of variation selectors for the background style is absent;
>> however, reading ahead in your message, the partial grid appears to be
>> common, therefore the reason to choose an alternate solution to the
>> background style is a strong one.
>>
>> Well, it’s text, Asmus, so you can delete all but one line of a board if
>> you want:
>>
>> ▕▨︁□︀▨︁□︀▨︁♘︀▨︁□︀▏
>>
>> There. So… what are you talking about? It’s a text matrix. It’s like a
>> kind of poem.
>>
>> ▗▖
>> ▕□︀▨︁□︀▨︁□︀▨︁♞︀▨︁▏
>> ▕▨︁□︀▨︁□︀▨︁□︀▨︁□︀▏
>> ▕□︀▨︁♔︀▨︁□︀▨︁□︀▨︁▏
>> ▕▨︁□︀▨︁□︀▨︁♘︀▨︁□︀▏
>> ▕□︀▨︁□︀▨︁♚︀▨︁□︀▨︁▏
>> ▕▨︁□︀▨︁□︀▨︁□︀▨︁□︀▏
>> ▕□︀▨︁□︀♙︁♛︀▨︁□︀▨︁▏
>> ▕▨︁□︀♕︁□︀▨︁♖︀▨︁□︀▏
>> ▝▘
>>
>> It even looks like one. That’s a meaningful pattern. A kind of writing
>> system.
>>
>
> For me it looks like ASCII art, a hack mixing various characters intended
> for different uses and ignoring all semantics, only working because it
> reuses similar-looking glyphs instead of being an actual encoding.
> That represetnation is absultely not semantically coherent.
>
> If we want to have true checkboard cells, we need characters specifically
> for them, and in them we'll place (or not) chess pieces or any other
> suitable symbol or letter. This means creating clusters (cell+ZWJ+piece).
> This will be coherent.
>
> If we want to have borders for boards, we need coherent characters for
> them (we do not expct them to be combined with pieces, just that they will
> properly glue with cells in the middle of the board, and that their metric
> match them in suitable fonts).
>
> The fact that legacy renderers or fonts won't display that correctly is
> definitely not an argument. Many scripts still have problems being
> represented with legacy renderers or fonts. But the encoding is made to be
> coherent semantically. Fonts and rederers will adapt their properties to
> render what is semantically wanted and that

Re: Unicode Emoji 5.0 characters now final

2017-03-31 Thread Mark Davis ☕️

Ken's observation "…approximately backwards…" is exactly right, and that's
the same reason why Markus suggested something along the lines of
"interoperable".

I don't think we've come up with a pithy category name yet, but I tried
different wording on the slides on http://unicode.org/emoji/. See what you
think, Doug.

Mark

Mark

On Thu, Mar 30, 2017 at 4:58 PM, Doug Ewell  wrote:

> Asmus Freytag wrote:
>
> > Recommending to vendors to support a minimal set is one thing.
> > Recommending to users to only use sequences from that set / or vendors
> > to not extend coverage beyond the minimum is something else. Both use
> > the word "recommendation" but the flavor is rather different (which
> > becomes more obvious when you re-phrase as I suggested).
> >
> > That seems to be the source of the disconnect.
>
> That seems a fair analysis.
>
> Another way of putting this is that marking a particular subset of valid
> sequences as "recommended" is one thing, while listing sequences in a
> table with a column "Standard sequence?", with some sequences marked
> "Yes" and others marked "No," is something else.
>
> Equivalently, characterizing a group of valid sequences as "Valid, but
> not recommended" is something else.
>
> If the goal is to tell users that three of the sequences are especially
> likely to be supported, or to tell vendors that they should prioritize
> support for these three, then "recommended" and "additional," used as a
> pair, would be more appropriate.
>
> If the goal is to tell users "we don't want you to use the other 5100
> sequences" and to tell vendors "we don't want you to offer support for
> them," then the existing wording is fine.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>

Re: Unicode Emoji 5.0 characters now final

2017-03-30 Thread Mark Davis ☕️

> `150` in UN M.49 which ISO 3166-1 was derived from and is compatible
with. CLDR
could safely adopt that if needed.

No need to "safely adopt". It is already valid:

http://www.unicode.org/reports/tr51/proposed.html#flag-emoji-tag-sequences

If you follow the links you'll end up at

http://unicode.org/repos/cldr/trunk/common/validity/region.xml

And find that 150 is already valid. (For the format of that file, see LDML.)

Where people have looked at the documentation and their questions are still
not answered, that feedback is useful so that the documentation can be
improved. But it appears that at least some people haven't bothered to do
that, when it could answer a lot of the questions/complaints on this list.

Mark

On Thu, Mar 30, 2017 at 11:48 AM, Christoph Päper <
christoph.pae...@crissov.de> wrote:

> Philippe Verdy  hat am 30. März 2017 um 00:40
> geschrieben:
>
> > There's no ISO 3166-1 code for Europe at the whole (does it exist
> legally if
> > we can't clearly define its borders?)
>
> `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with.
> CLDR
> could safely adopt that if needed.
>
> No alpha-2 and hence no RIS sequence, though. An Emoji Tag Sequence would
> be
> straight-forward, though: U+1F3F4-E0031-E0035-E0030-E007F.
>
>

Re: Unicode Emoji 5.0 characters now final

2017-03-30 Thread Mark Davis ☕️

> If I made an open-source emoji font that contained flags for all of the
> 5000ish
> ISO 3166-2 codes that actually map to one, would I automatically be
> considered a
> vendor?
>

Do I need to have to pay 18000(?) dollars a year for full membership
> first? (That's peanuts for multi-billion dollar companies, but
> unaffordable for
> most individuals and many FOSS projects.)
>

The answer to both of your questions is no.

Please see http://unicode.org/emoji/selection.html#timeline for details.
What the UTC is looking for is commitments from major vendors. It is not
sufficient to join Unicode: we have members who are not major vendors of
emoji. And there are some major vendors that are not members.

Of course, there is some judgment involved as to what constitutes "major":
at one extreme clearly 1B DAUs qualifies, and at the other extreme, 1K
doesn't.

Mark

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 920 matches

Mail list logo