Re: Tag characters

2015-05-18 Thread Mark Davis ☕️
​A few notes.

A more concrete proposal will be in a PRI to be issued soon, and people
will have a chance to comment more then. (I'm not trying to discourage
discussion, just pointing out that there will be something more concrete
relatively soon to comment on—people are pretty busy getting 8.0 out the
door right now.)

The principal reason for 3 digit codes is because that is the mechanism
used by BCP47 in case ISO screws up codes (as they did for CS).

The syntax does not need to follow the 3166 syntax - the codes correspond
but are not the same anyway. So we didn't see the necessity for the hyphen,
syntactically.

There is a difference between EU and UN; the former is in BCP47. That being
said, we could look at making the exceptionally reserved codes valid for
this purpose (or at least the UN code). It appears that there are only 3
exceptionally reserved codes that aren't in BCP47: EZ, UK, UN.

Just because a code is valid doesn't mean that there is a flag associated
with it. Just like the fact that you can have the BCP47 code ja-Ahom-AQ
doesn't mean that it denotes anything useful. I'd expect vendors to not
waste time with non-existent flags. However, we could also discuss having a
mechanism in CLDR to help provide guidelines as to which subdivisions are
suitable as flags.





Mark 

*— Il meglio è l’inimico del bene —*

On Sat, May 16, 2015 at 10:07 AM, Doug Ewell  wrote:

> L2/15-145R says:
>
>  On some platforms that support a number of emoji flags, there is
>> substantial demand to support additional flags for the following:
>> [...]
>> Certain supra-national regions, such as Europe (European Union flag)
>> or the world (e.g. United Nations flag). These can be represented
>> using UN M49 3-digit codes, for example "150" for Europe or "001" for
>> World.
>>
>
> These are uncomfortable equivalence classes. Not all countries in Europe
> are members of the European Union, and the concept of "United Nations" is
> not really the same by definition as "all countries in the world."
>
> The remaining UN M.49 code elements that don't have a 3166-1 equivalent
> seem wholly unsuited for this mechanism (and those that do, don't need it).
> There are no flags for "Middle Africa" or "Latin America and the Caribbean"
> or "Landlocked developing countries."
>
> Some trans-national organizations might _almost_ seem as if they could be
> shoehorned into an M.49 code element, like identifying 035 "South-Eastern
> Asia" with the ASEAN flag, but this would be problematic for the same
> reasons as 150 and 001.
>
> Among the ISO 3166-1 "exceptionally reserved" code elements are "EU" for
> "European Union" and "UN" for "United Nations." If these flags are the use
> cases, why not simply use those alpha-2 code elements, instead of burdening
> the new mechanism with the 3-digit syntax?
>
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>


Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Richard Wordingham
On Tue, 19 May 2015 01:25:54 +0200
Philippe Verdy  wrote:

> I don't work with strings, but with what you seem to call "traces",

For the concept of traces, Wikipedia suffices:
https://fr.wikipedia.org/wiki/Mono%C3%AFde_des_traces .

As far as text manipulation is concerned, the word 'trace' is an
idealisation of how Latin text is written.  Base letters advance the
writing point, so they commute with nothing - canonical combining class
0. Ideally, marks of different canonical combining classes do not
interact with one another when writing, so they commute.  In general,
marks of the same canonical combining class interact with one another,
be it only to move the subsequent one further from the base letter, so
they do not commute.

The traces I refer to are the equivalence classes of Unicode modulo
canonical equivalence.  To apply the theory, I have to regard
decomposable characters as notations for sequences of 1 to 4
indecomposable characters.  The notion works with compatibility
equivalence, and one could use a stronger notion of equivalence, so
that compatibility ideographs did not have singleton decompositions.

Thus, as strings, \u0323\u0302 and \u0302\u0323 are distinct, but as
traces, they are identical.

The lexicographic normal form that is most useful is simply NFD.  The
indecomposable characters are ordered by canonical combining class and
then it doesn't matter; one may as well use codepoint.

> but that I call sets of states (they are in fact bitsets, which may be
> compacted or just stored as arrays of bytes containing just 1 usefull
> bit, but which may be a bit faster; byte arrays are just simpler to
> program)., in a stack (I'll use bitsets later to make the structure
> more compact, if needed, but for now this is fast enough and not
> memory intensive even for large regexps with many repetitions with
> "+/*/{m,n}" or variable parts). 

Your 'bitset' sounds like a general purpose type, and to be an
implementation detail that surfaces in your discussion.

> The internal matcher uses NFD, but
> needs to track the positions in the original buffered input for
> returning captured matches.

That's how I'm working.  I do not regard decomposable characters as
atomic; I am emotionally happy with working with fractions of
characters.

> ... Greek, Cyrillic and Arabic, but also too few for Hebrew where
> "pathological" cases of regexps are certainly more likely to occur
> than in Latin, even with Vietnamese and its frequent double
> diacritics).

I was just thinking respecting canonical equivalence might be very
useful for Hebrew, particularly when dealing with text with accents.

> Finally a question:
> 
> I suppose that like many programmers you have read the famous "green
> dragon" book of Sethi/Aho/Ullman books about compilers. I can
> understand the terminology they use when spoeaking about automatas
> (and that is found in many other places), but apparently you are
> using some terms that I have to guess from their context.

No, I started off by hunting the web to try and work out what was
special about a regular expression, and found the articles in
Wikipedia quite helpful.  When working out how to make matching
respect canonical compliance, I started out with normalising strings
to NFD.  Only after I had generalised the closure properties of
regular languages from strings to these representative forms (with the
exception of Kleene star) did I finally discover what I had long
suspected, that I was not the first person to investigate regular
expressions on non-free monoids.  What does surprise me is that I
cannot find any evidence that any one else has made the connection
between trace monoids and Unicode strings under canonical equivalence.
I would like update the article on the trace monoid with its most
important example, Unicode strings under canonical equivalence, but,
alas, that seems to be 'original research'!

I'm beginning to think that 'letting the regex choose the input
character' might be a better method of dealing with interleaving of
subexpressions even for 'non-deterministic' engines, i.e. those which
follow all possible paths in parallel.  I'll have to compare the
relevant complexities.

> Good books on the subjext are now becoming difficutlt to find (or
> they are more expensive now), and too difficult to use on the web
> (for such very technical topics, it really helps to have a printed
> copy, that you an annotate, explore, or have beside you instead of on
> a screen (and printing ebooks is not an option if they are
> voluminous). May be you have other books to recommend.

Google Books, in English, gives access to a very helpful chapter on
regular languages in trace monoids in 'the Book of Traces'.

I found Russ Cox's Internet notes on regular expressions helpful, though
not everyone agrees with his love of non-determinism.

Richard.


Re: [OT] RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Philippe Verdy
2015-05-19 0:50 GMT+02:00 Doug Ewell :

> > but fow we just have "i-", deprecated but still valid,
>
> "i-" is not deprecated.


In the IANA database they are all replaced. I call that "deprecated" a bit
abusively, but there's no longer any interest in them.

>> for all other letters there's no parsing defined for now, their syntax
>> is unknown and they are not interchangeable without a standard, so
>> they are used only for private use

> Extension 't' was defined in 2011 and 'u' in 2010. They have
> well-defined syntax, specified in RFC 6497 and 6067 respectively.

You are speaking of extensions subtags after the initial subtag, I did not
discuss them.

I was just speaking about the initial subtag (before the first hyphen),
where "t" and "u" are not defined: only "x" and "i" are defined there ("i"
is not defined in the other singletons for trailing subtags).

> Undefined singletons may not be used for private use.

For private use (meaning NOT for interchanges) NOTHING is forbidden, you
are never bound to any standard. There are lots of places where these
private extensions are used and not discussed.

>> some BCP47 use an empty first subtag, i.e. the tag starts by an
>> hyphen;

> Absolutely, utterly false.

Absolutely, utterly true, but a word was missing in my sentence "some BCP
47 extensions" (which are private, local only to a specific software in its
internal data).


Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Philippe Verdy
I don't work with strings, but with what you seem to call "traces", but
that I call sets of states (they are in fact bitsets, which may be
compacted or just stored as arrays of bytes containing just 1 usefull bit,
but which may be a bit faster; byte arrays are just simpler to program).,
in a stack (I'll use bitsets later to make the structure more compact, if
needed, but for now this is fast enough and not memory intensive even for
large regexps with many repetitions with "+/*/{m,n}" or variable parts).
The internal matcher uses NFD, but needs to track the positions in the
original buffered input for returning captured matches.

There's some optiomization to reduce the size of the bitsets, by defining
classes. The representation of classes in Unicode is more challenging than
with plain ASCII or ISO8859-*, for this reason I limit their length
(differences between the smallest and highest code point), and over this
size the classes are just defined as a sorted string of pairs of
codepoints: I can perform a binary search in that string and look at the
position: with an even position the character is part of the class, with an
odd position, the character is not part of it).

Thanks to a previous message you posted, I noted that my code deos not work
correctly with Hangul precomposed syllables (I perform the decompoisition
to NFD of the input on the fly in the input buffer, but the buffer is
incorrectly advanced when there's a match to the next character, and it can
skip one or two characters of the original input instead of code points in
the NFD transformed input. (I don't have extensive cases for testing
Hangul, I have much more for Latin, Greek, Cyrillic and Arabic, but also
too few for Hebrew where "pathological" cases of regexps are certainly more
likely to occur than in Latin, even with Vietnamese and its frequent double
diacritics).

For now with the complex cases of replacements, I have no precise syntax
defined for specifiying replacements as as simple string with placeholders
I just allow these matches to be passed as objects (rather than just
strings) to a callback that performs the substitutions itself using the
array of captures given by the engine to the callback; I have no idea for
now about how to handle the special cases occuring when computing the
actual replacements:

The callback can insert/delete subsequences everywhere in the input buffer
which is limited in size by the extent of $0, plus any intermediate
characters when there's a discontinuity, plus their left and right contexts
when the match still does not include the full combining sequences (for
most uses cases, the left context is empty, but the right context is
frequently non-empty and contains all combining characters on over the last
base which is part of the match; the callback also does not necessarily
have to modify the input buffer it it does not want to perform replacements
in it, but in that case the input buffer is readonly and I don't need to
feed the contexts which remain empty. There are also left and right context
variables for *each* capture group (some of them may be partly or fully in
another returned capture group).

Finally a question:

I suppose that like many programmers you have read the famous "green
dragon" book of Sethi/Aho/Ullman books about compilers. I can understand
the terminology they use when spoeaking about automatas (and that is found
in many other places), but apparently you are using some terms that I have
to guess from their context.
Good books on the subjext are now becoming difficutlt to find (or they are
more expensive now), and too difficult to use on the web (for such very
technical topics, it really helps to have a printed copy, that you an
annotate, explore, or have beside you instead of on a screen (and printing
ebooks is not an option if they are voluminous). May be you have other
books to recommend. But finding these books in libraries is now becoming
difficult when many are closing or reducing their proposed collections (and
I don't like buying books on the Internet). For the rest, I tend to just
describe what I've made or used or experimented, even if the terms are not
the best ones (some of my references are in French, and dificutl to
translate).

On difficult topics like this one, I'm not paid to perform research and I
can only do that in my spare time from time to time, until I can make
something stable enough for a limited use (without experimental features)
In the past I could work on such research topic, but now we are pressed to
use extisting libraries and not pass lot of time, we sell smaller
incremental but limtied improvements and we know what is volutarily limited
and left unimplemented.


2015-05-18 23:14 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> On Mon, 18 May 2015 22:56:47 +0200
> Philippe Verdy  wrote:
>
> > Isn't it possible for your basic substitution to transform \uf073
> > into a character class [\uf071\uf072\uf073] that the regexp consider

[OT] RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Doug Ewell
This is why I knew I would regret it.

Clearing up some errors here. No more posts from me on this non-Unicode
topic after this one.

Philippe Verdy  wrote:

>> This would be a major revision to BCP 47, it would have nothing to do
>> with reordering,
>
> It woiuld have to do because all subtags after the pricmary language
> subtag in BCP47 are optional, and you can distincguish them only by
> their length *or* by the role assigned to specific singletons: there's
> already the "x" singleton exception (that is ordered at end), but
> other singletons are currently described to use a canonical order but
> it is used only for encoding variants unrelated to region subtags or
> even to the languages.

All non-initial singletons introduce an extension, except for 'x' which
introduces a private-use sequence, and which must be last.

Even if an extension were defined to hold top-level region information,
WHICH WILL NEVER HAPPEN, it would not matter whether that extension
appeared before or after other extensions, because it would be an
extension and not a region subtag.

> but fow we just have "i-", deprecated but still valid,

"i-" is not deprecated.

> for all other letters there's no parsing defined for now, their syntax
> is unknown and they are not interchangeable without a standard, so
> they are used only for private use

Extension 't' was defined in 2011 and 'u' in 2010. They have
well-defined syntax, specified in RFC 6497 and 6067 respectively.

Undefined singletons may not be used for private use.

> some BCP47 use an empty first subtag, i.e. the tag starts by an
> hyphen;

Absolutely, utterly false.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: [OT] RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Philippe Verdy
2015-05-18 23:55 GMT+02:00 Doug Ewell :

> Philippe Verdy  wrote:
>
> > If ever the country codes used in BCP47 becomes full (all pairs of
> > letters used), just some time before this happens, we could see new
> > prefixes added before a new range of code. It is possible to use a
> > 1-letter prefix for new country/territory code extensions, but with
> > some maintenance of BCP47 parsing rules (notably the letter used
> > should not be reordered with other singleton prefixes)
>
> This would be a major revision to BCP 47, it would have nothing to do
> with reordering,


It woiuld have to do because all subtags after the pricmary language subtag
in BCP47 are optional, and you can distincguish them only by their length
*or* by the role assigned to specific singletons: there's already the "x"
singleton exception (that is ordered at end), but other singletons are
currently described to use a canonical order but it is used only for
encoding variants unrelated to region subtags or even to the languages.

Very few singletons are used in fact (the singleton subtags occuring at
start of ther tag are also treated separately from others: it could also be
used to support new syntaxes for BCP47 tags, but fow we just have "i-",
deprecated but still valid, and "x-" for private use; for all other letters
there's no parsing defined for now, their syntax is unknown and they are
not interchangeable without a standard, so they are used only for private
use; another constraint comes from the length limit of subtags: the first
subtag is either a special singleton, or a primary language code using 2 or
3 letters for now; some BCP47 use an empty first subtag, i.e. the tag
starts by an hyphen; double hyphens could be used as extensions to chhange
locally the parsing rules and possibly return to the next logical subtag
and could be used to encode international organization without needing a
formal "exceptional reservation" in ISO 3166-1; for example "*-EU" in could
have been encoded as "--O-EU" and we could have the same system for NATO,
EEA, EFTA... There's still ample space for extensions of parsing rules in
BCP47, but not in ISO3166.)

ISO 3166 also encodes some 4-letter codes but they are not used in BCP47
(so there's no confusion with 4-letter script codes).


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Philippe Verdy
2015-05-18 23:38 GMT+02:00 Doug Ewell :

> Philippe Verdy  wrote:
>
> > So country codes cannot be reassigned (and we can expect many more
> > merges/splits or changes of regimes in the many troubled areas of the
> > world.
>
> Changes of regimes don't usually result in new 3166 code elements. The
> same is true for merges (look at DE/DD or YE/YD). New and changed
> country names usually do.


I just included merges only to be complete because they frequently occur a
little time after a split (and not with the former part).

But of course merges are much less frequent than splits. An in today's
globalized world, splits are even easier than they were in the past (where
merges were the results of invasions/wars/conquests).

The rate of splits is in fact accelerating in history, even in countries
living in peace, this does not mean that they terminate all their
partnerships, just that they take the right to create their own alliances.
There are reasons for them: cultural (language), national taxes, economic
difficulties in some regions, unemployment, management of resources (water,
constructible or cultivable soils) but the most important reasons is
political (defiance between political parties, or brutality against
minorities and mutual misunderstanding)...

In the last 50 years the most important changes came from decolonialisation
and its independances (that was completed at end the the 1970's). But now
we are seeing splits for much smaller entities, and this can occur in many
more places.

With ISO 3166-2 the situation within countries is much more complex and
more frequent (in Europe most countries are undergoing large changes in
their administrative divisions, the changes that will occur next year in
French regions is still not taken into account in ISO 3166-2, as well as
the change that is already effective within one department, splitted in two
parts with only one which remains as a department, the other one being a
group of communes erected into a new territorial collectivity taking all
powers of its former department, for local adminsitration only, but with
the national power still not divided in what is now a "circonscription
départementale" with the same departmental prefecture as before the split.

The hierarchical model of subdivisions has in fact lots of exceptions (look
into Spain, UK, Germany, it was already true for France and US, but now it
is also occuring even in the Metropolitan area). In fact we can see several
parallel layers of subdivisions, but for different legal roles/missions.

The ISO 3166-1 also assumes that everything is a country, but it is already
wrong with some dependant territories (not all) of France, UK, US, the
Netherlands, Spain and possibly some islands of China. And these codes also
don't map correctly to effective national divisions (the encoding for
claims in Antartica remains ambiguous, depending on who uses the data).
There are also reserves for things that are not countries but groups of
countries (EU, WIPO areas...), and there could exist new codes for other
international alliances (these look like "merges" except that they are not
full merges and the entities continue to coexist separately).


[OT] RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Doug Ewell
Philippe Verdy  wrote:

> If ever the country codes used in BCP47 becomes full (all pairs of
> letters used), just some time before this happens, we could see new
> prefixes added before a new range of code. It is possible to use a
> 1-letter prefix for new country/territory code extensions, but with
> some maintenance of BCP47 parsing rules (notably the letter used
> should not be reordered with other singleton prefixes)

This would be a major revision to BCP 47, it would have nothing to do
with reordering, and it would not in any case involve 1-letter prefixes,
which already have a different meaning. And the time frame we are
talking about is reminiscent of Ken's estimate of when 17 planes will no
longer be enough for Unicode.

> But I feel it will first be simpler to assign a special 2-letter code
> like "C1-" followed by a new new series of 2-letters country codes

We actually thought about this stuff over in LTRU. Really.

I'm not the least bit concerned about the DNS. Five years from now they
could be assigning TLDs consisting entirely of emoji.

This is no longer relevant to flag tags or anything else Unicode.
 
--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Doug Ewell
Philippe Verdy  wrote:

>> ISO 3166-1 already defines alpha-3 and numeric code elements, as well
>> as alpha-2.
>
> But how to work with the 2 letters limitation when the world wants
> more stability in codes (this was an important reason why ISO 639 was
> not fully integrated in IETF tags, and why the IETF tags have chosen
> the stability by keeping also the codes that hbave been deleted in ISO
> 639, but only deprecated in IETF language tags (BCP47).

I assume you're aware of the extent of my involvement in BCP 47, so this
is a semi-rhetorical question.

If and when ISO 3166/MA manages to use up all of the remaining 336
unassigned code elements -- nearly half of the TOTAL possible code space
of 676 two-letter combinations -- the corresponding numeric code
elements will be assigned as BCP 47 region subtags instead.

> We've already seen the famous reuse before 50 years (do you remember
> when CS was reassigned just a few months after it was discarded after
> an initial introduction for some months in Serbia-Montenegro?)

What actually happened was, 'CS' was withdrawn for Czechoslovakia and
then assigned to Serbia and Montenegro. At that time, the waiting period
was five years; the 'CS' incident is what resulted in the change to 50
years.

> But now let's remembers that parts of ISO 3166 are also included (not
> fully) in BCP47 tags that require the stability. IT will prohibit
> reassignments by ISO (or if this happens, this will break BCP47 and et
> IETF will reject the change and will use another subtag if needed.

Again, I'm guessing you already know that I know how BCP 47 works.

ISO 3166/MA can recycle alpha-2 code elements 50 years after withdrawal
if they feel like it. BCP 47 can't prevent that. That's why BCP 47 has a
mechanism to work around that possibility.

> So country codes cannot be reassigned (and we can expect many more
> merges/splits or changes of regimes in the many troubled areas of the
> world.

Changes of regimes don't usually result in new 3166 code elements. The
same is true for merges (look at DE/DD or YE/YD). New and changed
country names usually do.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸



Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Richard Wordingham
On Mon, 18 May 2015 22:56:47 +0200
Philippe Verdy  wrote:

> Isn't it possible for your basic substitution to transform \uf073
> into a character class [\uf071\uf072\uf073] that the regexp considers
> as a single entity to check ?
> In that case, backtracking for matching \u0F73*\u0F72 is simpler:
>  [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only
> one character class (instead of one character).

I'm still waiting for your explanation of how your scheme for European
diacritics (as used in SE Asia) would work.  This thread is intended for
the idea of using the regex to decide which character to take as the
next character from the input trace.  In the other thread, I'm still not
sure whether you're working with traces or strings.

Richard.


Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Philippe Verdy
Isn't it possible for your basic substitution to transform \uf073 into a
character class [\uf071\uf072\uf073] that the regexp considers as a single
entity to check ?
In that case, backtracking for matching \u0F73*\u0F72 is simpler:
 [\uF071\uF072\uF073]*\u0F72, as it just requires backtracking only one
character class (instead of one character).

It is also posible also to transform \u0F73*\u0F72 into the really
equivalent: (\u0F71\0F72)*\u0F72 | (\0F72\u0F71)*\u0F72  | (\0F73)*\u0F72
(assuming that in the non-capturing group you are already performing
canonical reorderings using counters (as many counters as there are
distinct ccc values in these groups, excluding blockers that create groups
always matched separately without any need to use backtrack "through" them:
if this does not match as at a blocking position, there's no other
alternative possible, so this is a definitive non-match)


2015-05-18 22:32 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> On Mon, 18 May 2015 21:05:49 +0200
> Philippe Verdy  wrote:
>
> > 2015-05-18 20:35 GMT+02:00 Richard Wordingham <
> > richard.wording...@ntlworld.com>:
> >
> > > The algorithm itself should be tractable - Mark Davis has published
> > > an algorithm to generate all strings canonically equivalent to a
> > > Unicode string, and what we need might not be so complex.
> >
> >
> > Even this algorithm from Mark Davis will fail in this case:
>
> How so?  The regexp is \u0F73*, which is converted to a non-capturing
> (\u0F71\u0F72)*.
>
> Given a string 0F40 0F71 0F73 0F42 representing the trace, matching
> will fail at 0F40 and an attempt will be made starting at the 0F71.
> The input string handling part will then present a run of three
> non-starters:
>
> \u0F71 \u0F71 \u0F72
>
> I think the process is even simpler than I first thought.
>
> The engine will look for a match for \u0F71, and take it from this
> list, leaving \u0F71 \u0F72.
>
> It will then look for a match for \u0F72, and take it form the list,
> leaving \u0F71.
>
> It will then look for a match for \u0F71, and take it from the list.
>
> It will then look for a match for \u0F72.  It will fail, and then back
> track, disgorging the \0F71.
>
> The input 'stream' now looks like \u0F71 \u0F42.  This will match
> nothing; it is after the matching substream.
>
> The matching substring is:
>
> None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42.
>
> Its value, as a trace, is 0F71 0F72.
>
> > - You can use it easily to transform a regexp containing (\u0F73)
> > into a regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)
>
> That is *not* what I am suggesting.  The regex needs decomposing, but
> no other transformations.  It is the string representing the input
> trace that is expanded.
>
> > - But this leaves the same problem for unbounded repetititions with
> > the "+" or "*" or "{m,}" operators.
>
> Not at all - that is the beauty of the scheme.  On the regex
> side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*.
> Putting back the unused fragments of the run of non-starters in the
> input trace is the most difficult part.
>
> > Now all the problem is how to do the backtracking,
>
> Yes, that may be more difficult than I thought.  Comparing against
> literal characters is simple, but it may be more complicated when
> matching against a list of alternative characters.  Back-tracking
> schemes may not be set up to try the next character on a list of
> alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input
> string 0F71 0F72.  The alternative (\u0f72|\u0f71) would first take the
> 0F72, and only on backtracking would it take the 0F71 instead.  This is
> an issue with traces that does not appear with strings.
>
> Richard.
>


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Philippe Verdy
If ever the country codes used in BCP47 becomes full (all pairs of letters
used), just some time before this happens, we could see new prefixes added
before a new range of code. It is possible to use a 1-letter prefix for new
country/territory code extensions, but with some maintenance of BCP47
parsing rules (notably the letter used should not be reordered with other
singleton prefixes)

But I feel it will first be simpler to assign a special 2-letter code like
"C1-" followed by a new new series of 2-letters country codes (ccTLDs will
survive, in fact with the development of new gTLDs not limited to 2
characters, the new countries will prefer asking for a more descriptive
gTLD, even if they don't have a 2-letter ccTLD.

Or 2-letter codes will be deprecated in favor of 3-letter codes (but the
IETF will keep all the existing 2-letter ccTLDs as long as their sponsors
support them (and don't require changing it to another TLD, even if this
breaks existing URLs encoded throughout the web).

There's no requirement for ISO 3166 codes to match exactly with a TLD in
the global DNS (this is already the case since long for the ".uk" ccTLD,
because ".gb" is almost unused). But the stability of couintry codes is
desirable as well in URLs (stored within encoded documented and for which
it will be hard to make global substitutions: the solution could be to use
tracking dates to resolve domain names, but the worldwide DNS currently
does not support this type of query by date and registrars would not like
to have to keep history files for long, and software/OS developers don't
want to include and maintain such data for their domain name resolving
clients).

It is however possible that in some future the existing URLs requiring
domain names will be deprecated in favor of unique IDs (e.g. based on
IPv6): users won't see ndomain names, but labels retreived from some
whois-like database, or shown by search engines and possibly translated. It
would be also an improvement even if this breaks the business of existing
registrars (however registrars will still have business for selling
PKI-related services). These IDs can also be used in URIs. In fact the DNS
system is already antique in its design (and its very strange and complex
encoding for IDNA that no one can read).


2015-05-18 22:10 GMT+02:00 Doug Ewell :

> Markus Scherer  wrote:
>
> > As far as I can tell from your quotes, CLDR will say what's valid
> > (plus containment info), and Unicode permits you to show a flag for
> > any valid tag. North Lanarkshire seems perfectly fine.
>
> I'm under the impression that this will be a standard Unicode mechanism,
> defined in principle by TUS and in detail by the upcoming revision of
> UTR #51, with data (but no additional rules) supplied by CLDR.
>
> > I am curious to see if the redundant hyphen will be part of the
> > syntax.
>
> Like Philippe, I don't believe the hyphen is "redundant." ISO 3166-2
> requires it (Section 5.2), and the syntax diagram at the end of
> L2/15-145R shows it:
>
> B ((TL{2} (TH (TL|TD){3})?) | (TD{3}))
>
> where TH is TAG HYPHEN-MINUS.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸
>
>
>


Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Richard Wordingham
On Mon, 18 May 2015 22:40:21 +0300
Eli Zaretskii  wrote:

> > Date: Mon, 18 May 2015 19:35:45 +0100
> > From: Richard Wordingham 
> > 
> > Mark Davis has published an algorithm to generate all strings
> > canonically equivalent to a Unicode string
> 
> Where can I find the description of that algorithm?

Section 5 of http://unicode.org/notes/tn5/ .  There's a lot of detail
missing, and its easy to overlook the Hangul sylables.  The complete
code is rather more complicated than it looks from the wording,
especially if you want successive candidates on successive calls.  You
also need to include the legal permutations of the non-starters - the
code as given only delivers the FCD canonical equivalents.

On further thought, I also think its actually unnecessary for this
application.

Richard.


Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Richard Wordingham
On Mon, 18 May 2015 21:05:49 +0200
Philippe Verdy  wrote:

> 2015-05-18 20:35 GMT+02:00 Richard Wordingham <
> richard.wording...@ntlworld.com>:
> 
> > The algorithm itself should be tractable - Mark Davis has published
> > an algorithm to generate all strings canonically equivalent to a
> > Unicode string, and what we need might not be so complex.
> 
> 
> Even this algorithm from Mark Davis will fail in this case:

How so?  The regexp is \u0F73*, which is converted to a non-capturing
(\u0F71\u0F72)*.

Given a string 0F40 0F71 0F73 0F42 representing the trace, matching
will fail at 0F40 and an attempt will be made starting at the 0F71.
The input string handling part will then present a run of three
non-starters:

\u0F71 \u0F71 \u0F72

I think the process is even simpler than I first thought.

The engine will look for a match for \u0F71, and take it from this
list, leaving \u0F71 \u0F72.

It will then look for a match for \u0F72, and take it form the list,
leaving \u0F71.

It will then look for a match for \u0F71, and take it from the list.

It will then look for a match for \u0F72.  It will fail, and then back
track, disgorging the \0F71.

The input 'stream' now looks like \u0F71 \u0F42.  This will match
nothing; it is after the matching substream. 

The matching substring is:

None of 0F40, all of 0F71, the second part of 0F72 and none of 0F42.

Its value, as a trace, is 0F71 0F72.

> - You can use it easily to transform a regexp containing (\u0F73)
> into a regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)

That is *not* what I am suggesting.  The regex needs decomposing, but
no other transformations.  It is the string representing the input
trace that is expanded.

> - But this leaves the same problem for unbounded repetititions with
> the "+" or "*" or "{m,}" operators.

Not at all - that is the beauty of the scheme.  On the regex
side, \u0F73* is as straight forward as non-capturing (\u0061\u0062)*.
Putting back the unused fragments of the run of non-starters in the
input trace is the most difficult part.

> Now all the problem is how to do the backtracking,

Yes, that may be more difficult than I thought.  Comparing against
literal characters is simple, but it may be more complicated when
matching against a list of alternative characters.  Back-tracking
schemes may not be set up to try the next character on a list of
alternatives, e.g. so that pattern (\u0f72|\u0f71)\u0f72 matches input
string 0F71 0F72.  The alternative (\u0f72|\u0f71) would first take the
0F72, and only on backtracking would it take the 0F71 instead.  This is
an issue with traces that does not appear with strings.

Richard.


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Philippe Verdy
2015-05-18 22:14 GMT+02:00 Doug Ewell :

> I know I'll regret this...
>
You should not

>
> Philippe Verdy  wrote:
>
> > Sometime in a future, two letters will not be enough even in ISO
> > 3166-1, if countries continue to split/merge (this does not happen
> > frequently but is occurs every few years; and it will not be possible
> > to reuse old codes that are maintained for a long period).
>
> ISO 3166-1 already defines alpha-3 and numeric code elements, as well as
> alpha-2.
>

But how to work with the 2 letters limitation when the world wants more
stability in codes (this was an important reason why ISO 639 was not fully
integrated in IETF tags, and why the IETF tags have chosen the stability by
keeping also the codes that hbave been deleted in ISO 639, but only
deprecated in IETF language tags (BCP47).

We've already seen the famous reuse before 50 years (do you remember when
CS was reassigned just a few months after it was discarded after an initial
introduction for some months in Serbia-Montenegro?)

ISO coding standard are known to be unstable. This would also be true of
the UCS if Unicode did not push its stability pact with ISO!

But now let's remembers that parts of ISO 3166 are also included (not
fully) in BCP47 tags that require the stability. IT will prohibit
reassignments by ISO (or if this happens, this will break BCP47 and et IETF
will reject the change and will use another subtag if needed.

So country codes cannot be reassigned (and we can expect many more
merges/splits or changes of regimes in the many troubled areas of the world.


RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Doug Ewell
I know I'll regret this...

Philippe Verdy  wrote:

> Sometime in a future, two letters will not be enough even in ISO
> 3166-1, if countries continue to split/merge (this does not happen
> frequently but is occurs every few years; and it will not be possible
> to reuse old codes that are maintained for a long period).

ISO 3166-1 already defines alpha-3 and numeric code elements, as well as
alpha-2.

ISO 3166/MA has added approximately one code element per year on average
since the breakup of the Soviet Union. There are approximately 336
unassigned alpha-2 code elements, and if any of the assigned ones is
withdrawn, it can be recycled in 50 years.

> May be then we'll have ISO 3166-1 codes using digits (such as "A1" or
> "1A"), but this will cause some problems to map them to IETF ccTLD
> codes (within the DNS root registry).

Adapting to this challenge, if and when it arises, should be child's
play for the DNS, which has recently introduced TLDs like
".சிங்கப்பூர்" (or ".xn--clchc0ea0b2g2a9gcd" if
one prefers).

> As well the UN M.49 numeric codes will get full if it continues with
> its current allocation scheme (using ranges of numbers by continental
> regions). Or the other solution will be to extend the set of allowed
> letters.

UN M.49 numeric code elements (equivalent to ISO 3166-1) are assigned
alphabetically by English country name, or as close as possible, with
some exceptions related to historical names. There are no allocations by
geographical region.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




RE: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Doug Ewell
Markus Scherer  wrote:

> As far as I can tell from your quotes, CLDR will say what's valid
> (plus containment info), and Unicode permits you to show a flag for
> any valid tag. North Lanarkshire seems perfectly fine.

I'm under the impression that this will be a standard Unicode mechanism,
defined in principle by TUS and in detail by the upcoming revision of
UTR #51, with data (but no additional rules) supplied by CLDR.

> I am curious to see if the redundant hyphen will be part of the
> syntax.

Like Philippe, I don't believe the hyphen is "redundant." ISO 3166-2
requires it (Section 5.2), and the syntax diagram at the end of
L2/15-145R shows it:

B ((TL{2} (TH (TL|TD){3})?) | (TD{3}))

where TH is TAG HYPHEN-MINUS.

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Eli Zaretskii
> Date: Mon, 18 May 2015 19:35:45 +0100
> From: Richard Wordingham 
> 
> Mark Davis has published an algorithm to generate all strings
> canonically equivalent to a Unicode string

Where can I find the description of that algorithm?


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Richard Wordingham
On Mon, 18 May 2015 19:37:06 +0100
Andrew West  wrote:

> > <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK)
> > for the North Lanarkshire council area flag
> 
> I don't believe that North Lanarkshire has an associated flag, which I
> think is the case for most UK counties and councils (Cornwall, Devon
> and Dorset all have flags, but they may be the exceptions).  In fact
> not all of the four nations comprising the UK have a flag -- for
> political reasons there is no official flag for Northern Ireland, so I
> do not know what an implementation would display for <1F3F3 E0047
> E0042 E002D E004E E0049 E0052> (GB-NIR), perhaps just a plain flag
> emblazoned with "GB-NIR".

As the Ulster Banner is still in use, and still does unofficially
represent Northern Ireland, perhaps it should have its own codepoint.

I'm not sure of the strength of the argument for St Patrick's Cross.
Perhaps it too should have its own codepoint, especially if it is
evolving from being a flag of Ireland (apparently not used by the Irish
rugby union team) to a flag of Northern Ireland.

Richard.


Re: Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Philippe Verdy
2015-05-18 20:35 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> The algorithm itself should be tractable - Mark Davis has published
> an algorithm to generate all strings canonically equivalent to a
> Unicode string, and what we need might not be so complex.


Even this algorithm from Mark Davis will fail in this case:

- You can use it easily to transform a regexp containing (\u0F73) into a
regexp containing  (\u0F73|\u0F71\u0F72|\u0F71\u0F72)

- But this leaves the same problem for unbounded repetititions with the "+"
or "*" or "{m,}" operators.

- However you can use it for bounded repetitions with "{m,n}", provided
that "n" is not too large because the total number of expendaned
alternatives (without repetitions) explodes exponentially with a power
proportional to "n" (the base of the exponent depends on the basic
non-repeated string and the number of canonical equivalents it has.

Now all the problem is how to do the backtracking, and if it works, and how
to expose the matched captures (which will still be discontiguous,
including $0) and then how you can perform a safe find&replace operation:
it is hard to specify the replacement with simple "$n" placeholders, you
need more complex placeholders for handling discontiguous matches:

$n has to become not just a string, but an object whose default "tostring"
property is the exact content of the match, but other properties are needed
to expose the interleaving characters, or some context before and after the
match (notably when these contexts contain combining characters that are
NOT blocked by the match itself.

Backtracing is an internal thing before even handling matches, they occur
where there is still NO match to return, even if the regexp engine offers a
way to use a callback instead of a basic replacement string containing "$n"
placeholders, so this callback would not be called.


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Philippe Verdy
The hyphen is not redundant in ISO 3166 that defines primary codes with
variable length (even if ISO 3166 part 1 for now only use two-letter codes).
Sometime in a future, two letters will not be enough even in ISO 3166-1, if
countries continue to split/merge (this does not happen frequently but is
occurs every few years; and it will not be possible to reuse old codes that
are maintained for a long period). May be then we'll have ISO 3166-1 codes
using digits (such as "A1" or "1A"), but this will cause some problems to
map them to IETF ccTLD codes (within the DNS root registry).
As well the UN M.49 numeric codes will get full if it continues with its
current allocation scheme (using ranges of numbers by continental regions).
Or the other solution will be to extend the set of allowed letters.

2015-05-18 20:28 GMT+02:00 Markus Scherer :

> On Mon, May 18, 2015 at 11:19 AM, Doug Ewell  wrote:
>
>> Is the new mechanism intended to allow flag tags that include either
>> "subtype" values or "contains" values?
>
>
> As far as I can tell from your quotes, CLDR will say what's valid (plus
> containment info), and Unicode permits you to show a flag for any valid tag.
> North Lanarkshire seems perfectly fine.
>
> I am curious to see if the redundant hyphen will be part of the syntax.
>
> markus
>


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Andrew West
On 18 May 2015 at 19:19, Doug Ewell  wrote:
>
> Is the new mechanism intended to allow flag tags that include either
> "subtype" values or "contains" values? For example:

That is my understanding.

> <1F3F3 E0047 E0042 E002D E0053 E0043 E0054> (GB-SCT)
> for the Scottish flag
>
> and
>
> <1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK)
> for the North Lanarkshire council area flag

I don't believe that North Lanarkshire has an associated flag, which I
think is the case for most UK counties and councils (Cornwall, Devon
and Dorset all have flags, but they may be the exceptions).  In fact
not all of the four nations comprising the UK have a flag -- for
political reasons there is no official flag for Northern Ireland, so I
do not know what an implementation would display for <1F3F3 E0047
E0042 E002D E004E E0049 E0052> (GB-NIR), perhaps just a plain flag
emblazoned with "GB-NIR".

Andrew


Regexes, Canonical Equivalence and Backtracking of Input

2015-05-18 Thread Richard Wordingham
Philippe and I have got bogged down in a long discussion of how to
parse traces of Unicode strings under canonical equivalence against
non-regular Kleene star of regular expressions.  Fortunately, such
expressions can be expected to have very little use.  A seemingly simple
example is the regex \u0f73* i.e. any number of occurrences of U+0F73
TIBETAN VOWEL SIGN II, and not \u0f71\u0f72*. An example of a string
matching under canonical equivalence is 0F71 0F71 0F72 0F72.

I believe we both thought that characters would arrive from the trace
in a deterministic order.  Now, many regular expression engines
back-track their parsing of the input string (no-one has reported
working with input traces).  A possibly useful trick would be for
characters to be taken from the input file in accordance with the
matching to the pattern, with input also back-tracked if matching
fails.  The notion of next character would depend on the state of the
parsing algorithm.

In the example above, the engine would just take the input in the
order 0F71 0F72 0F71 0F72.  Match found, job done.

One advantage of this scheme is that there would be no need for
adjustments to deal with the interleaving of adjacent matches to
successive subexpressions.  There would be no nagging worry that
one's rational expression was not a regular expression when applied
to traces.

Any theoreticians around may be wondering how this magic is achieved.
The simple answer is that the non-finiteness has been transferred to:

(1) the back-tracking through parse options; and
(2) the algorithm to walk through the character sequencing options.

The algorithm itself should be tractable - Mark Davis has published
an algorithm to generate all strings canonically equivalent to a
Unicode string, and what we need might not be so complex.

I offer this thought up as it seems that, for a regex engine working on
traces with deterministic input, the byte code for a regex
concatenation AB or iteration A* is much more complicated than the code
for the subregexes A and B.  I have a worry that the length of the
compiled code might even be exponential with the length of the regex.
(I may be wrong - there might be a limit to what one can do for worst
case complexity of the interleaving.)  Choosing the input to match the
regex would remove this problem.

Richard.


Re: Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Markus Scherer
On Mon, May 18, 2015 at 11:19 AM, Doug Ewell  wrote:

> Is the new mechanism intended to allow flag tags that include either
> "subtype" values or "contains" values?


As far as I can tell from your quotes, CLDR will say what's valid (plus
containment info), and Unicode permits you to show a flag for any valid tag.
North Lanarkshire seems perfectly fine.

I am curious to see if the redundant hyphen will be part of the syntax.

markus


Flag tags with U+1F3F3 and subtypes

2015-05-18 Thread Doug Ewell
L2/15-145R says:

> In CLDR 28, LDML will define a unicode_subdivision_subtag which also
> provides validity criteria for the codes used for regional
> subdivisions (see CLDR ticket #8423). When representing regional
> subdivisions using ISO 3166-2 codes, only those codes that are valid
> for the LDML unicode_subdivision_subtag should be used.

The preliminary subdivisions.xml file includes entries like this:


 (GB-SCT)
for the Scottish flag

and

<1F3F3 E0047 E0042 E002D E004E E004C E004B> (GB-NLK)
for the North Lanarkshire council area flag

--
Doug Ewell | http://ewellic.org | Thornton, CO 🇺🇸




Re: Arabic diacritics

2015-05-18 Thread عبد الرحمان أيمن
many thanks, this exactly the needed information :)

respectfully

2015-05-15 19:09 GMT+03:00 Denis Jacquerye :

> You should use ARABIC SHADDA U+0651 in all positions. The presentation
> forms (isolated, medial, final forms) are for compatibility with legacy
> systems.
> See what is said in http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf
> about the Arabic Presentation Forms-B.
>
> Cheers,
>
>
> On Fri, 15 May 2015 at 15:53 عبد الرحمان أيمن <
> abdo.alrhman.ai...@gmail.com> wrote:
>
>> hi,
>>
>> regarding the Arabic diacritics. e.g. for the Shadda, we
>> have:
>>
>> 1. The form that people type:
>> http://unicode-table.com/en/0651/
>>
>> 2. An Isolated form. It should be the same, but looks different in the
>> Unicode table, which is confusing me now.
>> http://unicode-table.com/en/FE7C/
>>
>> 3. A medial form:
>> http://unicode-table.com/en/FE7D/
>>
>> When do I use 1/2, and when do I use 3?
>>
>> some diacritics has e.g. isolated and medial forms. Some have
>> only one of these forms, some have both. So, where does each of them go?
>>
>> respectfully
>>
>>