Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-11-22 Thread Henri Sivonen via Unicode
; broad categories. Very approximately: > > junk ~= [[:cn:][:cs:][:co:]]+ > whitespace ~= [[:z:][:c:]-junk]+ > syntax ~= [[:s:][:p:]] // broadly speaking, including both the language > syntax & user-named operators > identifiers ~= [all-else]+ > > UAX #31 specifies sever

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-13 Thread Mark Davis ☕️ via Unicode
ace ~= [[:z:][:c:]-junk]+ - syntax ~= [[:s:][:p:]] // broadly speaking, including both the language syntax & user-named operators - identifiers ~= [all-else]+ UAX #31 specifies several different kinds of identifiers, and takes roughly that approach for http://unicode.org/repor

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Hans Åberg via Unicode
> On 8 Jun 2018, at 11:07, Henri Sivonen via Unicode > wrote: > > My question is: > > When designing a syntax where tokens with the user-chosen characters > can't occur next to each other without some syntax-reserved characters > between them, what advantages are there from limiting the user-

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-08 Thread Henri Sivonen via Unicode
; stance, why should a programming language, if it opts to support > non-ASCII identifiers in an otherwise ASCII core syntax, implement the > complexity of UAX #31 instead of allowing everything above ASCII in > identifiers? In other words, what problem does making a programming > languag

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Frédéric Grosshans via Unicode
.  UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely نامهای and نامه‌ای; I think those are OK in

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Asmus Freytag via Unicode
On 6/7/2018 9:01 AM, Alastair Houghton via Unicode wrote: But please don’t misunderstand; I am not — and have not been — arguing against non-ASCII identifiers. We were asked whether there were any problems. These are problems (or perhaps we might

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Alastair Houghton via Unicode
On 7 Jun 2018, at 15:51, Frédéric Grosshans via Unicode wrote: > >> IMO the major issue with non-ASCII identifiers is not a technical one, but >> rather that it runs the risk of fragmenting the developer community. >> Everyone can *type* ASCII and everyone can read

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Frédéric Grosshans via Unicode
Le 06/06/2018 à 11:29, Alastair Houghton via Unicode a écrit : On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode wrote: The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Mark Davis ☕️ via Unicode
Got it, thanks. Mark On Thu, Jun 7, 2018 at 3:29 PM, Richard Wordingham via Unicode < unicode@unicode.org> wrote: > On Thu, 7 Jun 2018 10:42:46 +0200 > Mark Davis ☕️ via Unicode wrote: > > > > The proposal also asks for identifiers to be treated as equivalent > >

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Richard Wordingham via Unicode
On Thu, 7 Jun 2018 10:42:46 +0200 Mark Davis ☕️ via Unicode wrote: > > The proposal also asks for identifiers to be treated as equivalent > > under > NFKC. > > The guidance in #31 may not be clear. It is not to replace > identifiers as typed in by the user by thei

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Richard Wordingham via Unicode
On Thu, 7 Jun 2018 13:32:13 +0200 Joan Montané via Unicode wrote: > 2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode < > unicode@unicode.org>: > * Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT NFKC decomposes > to LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): > * ŀ, LATIN SMALL LE

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Philippe Verdy via Unicode
If you intend to allow all the standard orthography of common languages, you would also need to support apostrophes and regular hyphens in identifiers, including those from ASCII ! The Catalan middle dot is just a compact variant of the hyphen, it should have better been a diacritic, but the

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Joan Montané via Unicode
2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode < unicode@unicode.org>: > Hi, > > The Rust community is considering > <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii > identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/>

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Hans Åberg via Unicode
rs), at low resolution or with small font sizes, > where most text is in sans-serif Latin and not slanted/italicized and not > using an handwritten style. > > If you think about writing a functional programming language using inline > formulas, then the "π" symbol may be o

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Philippe Verdy via Unicode
font sizes, where most text is in sans-serif Latin and not slanted/italicized and not using an handwritten style. If you think about writing a functional programming language using inline formulas, then the "π" symbol may be ok for the constant, and custom identifiers for a function would us

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Mark Davis ☕️ via Unicode
> The proposal also asks for identifiers to be treated as equivalent under NFKC. The guidance in #31 may not be clear. It is not to replace identifiers as typed in by the user by their NFKC equivalent. It is rather to internally *identify* two identifiers (as typed in by the user) as being

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Hans Åberg via Unicode
> On 7 Jun 2018, at 03:56, Asmus Freytag via Unicode > wrote: > > On 6/6/2018 2:25 PM, Hans Åberg via Unicode wrote: >>> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode >>> wrote: >>> >>> The Rust community is considering adding

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Alastair Houghton via Unicode
ard to type on someone else’s keyboard; some thought needs to be given before choosing non-ASCII identifiers. Sometimes you might even choose to support multiple spellings of an API to avoid any problems. And in other cases it’s a good idea to remember that someone other than you might have to ma

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-07 Thread Richard Wordingham via Unicode
On Tue, 5 Jun 2018 01:37:47 +0100 Richard Wordingham via Unicode wrote: > The decomposed > form that looks the same is นํ้า . > The problem is that for sane results, needs > special handling. This sequence is also often untypable - part of the > protection against Thai homographs. I've been mis

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Richard Wordingham via Unicode
On Mon, 4 Jun 2018 12:49:20 -0700 Manish Goregaokar via Unicode wrote: > Hi, > > The Rust community is considering > <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii > identifiers, which follow UAX #31 > <http://www.unicode.org/reports/tr31/>

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Asmus Freytag via Unicode
On 6/6/2018 2:25 PM, Hans Åberg via Unicode wrote: On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode wrote: The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Hans Åberg via Unicode
> On 4 Jun 2018, at 21:49, Manish Goregaokar via Unicode > wrote: > > The Rust community is considering adding non-ascii identifiers, which follow > UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for > identifiers to be treated as equivalent under

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Henri Sivonen via Unicode
On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode wrote: > The Rust community is considering adding non-ascii identifiers, which follow > UAX #31 (XID_Start XID_Continue*, with tweaks). UAX #31 is rather light on documenting its rationale. I realize that XML is a differen

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Philippe Verdy via Unicode
It could be argued that "modern" languages could use unique identifiers for their syntax or API independantly of the name being rendered. The problem is that translated names may collide in non-obvious way and become ambiguous. We've already seen the problems it caused in Excel with

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Alastair Houghton via Unicode
er something explicitly encouraged). AppleScript was also designed to be (French and Japanese syntaxes were defined), and I have an inkling that someone once told me that at least one translation had actually shipped, though the translated variants are now deprecated as far as I’m aware. Translated

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-06 Thread Alastair Houghton via Unicode
On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode wrote: > > The Rust community is considering adding non-ascii identifiers, which follow > UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for > identifiers to be treated as equivalent under NFKC. >

Requiring typed text to be NFKC (was: Can NFKC turn valid UAX 31 identifiers into non-identifiers?)

2018-06-05 Thread Manish Goregaokar via Unicode
Following up from my previous email , one of the ideas that was brought up was that if we're going to consider NFKC forms equivalent, we should require things to be typed in NFKC. I'm a bit wary of this. As Richard brought up in th

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Martin J. Dürst via Unicode
Hello Rebecca, On 2018/06/05 12:43, Rebecca T via Unicode wrote: Something I’d love to see is translated keywords; shouldn’t be hard with a line in the cargo.toml for a ruidmentary lookup. Again, I’m of the opinion that an imperfect implementation is better than no attempt. I remember reading a

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Rebecca T via Unicode
I think that the benefits of inclusion from allowing non-ASCII identifiers far outweigh any corner cases this might cause. (Although ironing out and analyzing those is of course important, I don’t think they should be obstacles for implementing this kind of thing.) Something I’d love to see is

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Richard Wordingham via Unicode
On Mon, 4 Jun 2018 12:49:20 -0700 Manish Goregaokar via Unicode wrote: > Hi, > > The Rust community is considering > <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii > identifiers, which follow UAX #31 > <http://www.unicode.org/reports/tr31/>

Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Manish Goregaokar via Unicode
; adding non-ascii > identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/> > (XID_Start XID_Continue*, with tweaks). The proposal also asks for > identifiers to be treated as equivalent under NFKC. > > Are there any cases where this will lead to inconsistencies

Can NFKC turn valid UAX 31 identifiers into non-identifiers?

2018-06-04 Thread Manish Goregaokar via Unicode
Hi, The Rust community is considering <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/> (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent unde

Language / Locale identifiers

2010-12-08 Thread Mark Davis ☕
For those of you interested in language and local identifiers, the RFC for Unicode locale identifiers was just released: http://tools.ietf.org/html/rfc6067 See also http://unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers Mark

New I-D for Internationalized Resource Identifiers

2002-04-17 Thread Martin Duerst
. Based on discussions at the W3C Technical Plenary in February, and in particular on input from Larry Masinter, we have made some changes in the responsibilities for the Internationalized Resource Identifiers (IRI) draft, as follows: - The W3C I18N WG is taking on responsibility for carefully

RE: Identifiers

2001-04-16 Thread Yves Arrouye
> > We have normalization similar to > > the one you're talking about in our Internet Keywords > system. It is built on > > top of NFKC. It is good for users, but then it is also very > specific. > > Details, details! (Or do you consider that stuff a proprietary > advantage?) I don't really.

RE: Identifiers

2001-04-16 Thread Yves Arrouye
> Florian, I respectfully suggest that you look up the various technical > reports that accompany the Unicode standard. It looks like ther may be > certain confusion about characters and glyphs Oops, got tripped by my native French language. I didn't mean "certain" but "some". Do not conclude tha

RE: Identifiers

2001-04-16 Thread Yves Arrouye
> There should be a method to overcome the source sepearation rule which > might have saved certain identical characters from unification. > > > - U+0048 LATIN CAPITAL LETTER H > > - U+0397 GREEK CAPITAL LETTER ETA > > - U+041D CYRILLIC CAPITAL LETTER EN > > - U+13BB CHEROKEE LETTER M

Re: Identifiers

2001-04-16 Thread David Starner
ral Unicode problem, but you have to know > about this issues in order to design protocols which permit a large > Unicode subset in identifiers and can nevertheless be used > sucessfully. I don't see why it's that much of a concern if the users pick reasonable identifiers, and I do

Re: Identifiers

2001-04-16 Thread Kenneth Whistler
ers. But we determined long ago that for the purposes of computer character encoding, Latin, Greek, Cyrillic, and Cherokee are distinct scripts. Unifications are *not* applied across scripts just because letters happen to look alike in particular instances. > > I don't think it's a general

Re: Identifiers

2001-04-16 Thread Florian Weimer
, but it can get pretty complicated as soon as computers are involved. I've helped people to cope with an environment were the glyphs from ISO-8859-* were not unified, and this has certainly some hairy consequences. I don't think it's a general Unicode problem, but you have to know

Re: Identifiers

2001-04-16 Thread DougEwell2
Florian Weimer <[EMAIL PROTECTED]> wrote: > > It will always be necessary for people to think a bit when creating > > their email addresses,... > > Well, you can't expected people to know most of Unicode just to choose > an email address. :-/ and then later: > > In general, the problem is

Re: Identifiers

2001-04-16 Thread David Starner
se are not equivalent under normalization? That's a pity. Why would they be? That would break spell-checking and searching among other things. > > There are a number of spaces, and apostrophes that will get confused > > for each other. > > Hmm, perhaps it's best to

RE: Identifiers

2001-04-16 Thread Yves Arrouye
> (I don't know if email addresses will be internationalized anytime > soon. This is just an example. ;-) http://www.-i-d-n.net/ They have a normalization process that may be used for e-mail someday. It explictely does not do anything about similar looking glyphs. Read their list archive, I'm s

RE: Identifiers

2001-04-16 Thread Yves Arrouye
> > On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote: > > > Is it sufficient to mandate that all such identifiers > MUST be KC- or > > > KD-normalized? Does this guarantee print-and-enter round-trip > > > compatibility? > > > >

Re: Identifiers

2001-04-16 Thread John Cowan
Florian Weimer scripsit: > Is it sufficient to mandate that all such identifiers MUST be KC- or > KD-normalized? Does this guarantee print-and-enter round-trip > compatibility? It couldn't possibly. I could spoof an email from you, for example, by using a GREEK SMALL LETTER OMIC

Re: Identifiers

2001-04-16 Thread Florian Weimer
Martin Duerst <[EMAIL PROTECTED]> writes: > Of course, KC/KD-normalization is not sufficient. The problem > already exists in ASCII. I/l/1 and 0/O can easily be confused. Most fonts (especially typewrite-style ones often used to print email addresses) do differentiate quite clearely between thes

Re: Identifiers

2001-04-16 Thread Florian Weimer
David Starner <[EMAIL PROTECTED]> writes: > On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote: > > Is it sufficient to mandate that all such identifiers MUST be KC- or > > KD-normalized? Does this guarantee print-and-enter round-trip > > compatibility? &

Re: Identifiers

2001-04-16 Thread Florian Weimer
root <[EMAIL PROTECTED]> writes: > As far as I know, the various language committees have settled on > Normalization Form C for identifiers. ECMAscript (nee LiveScript, > then JavaScript) being 1 example. In contrast, C# doesn't use any normalization, at least according t

Re: Identifiers

2001-04-16 Thread Asmus Freytag
At 09:24 AM 4/16/01 +0900, Martin Duerst wrote: >NFC only eliminates things that are supposed to look exactly >the same. NFKC eliminates quite a bit more than that. NFKC eliminates some things that are quite distinct - it should not be seen as a general purpose folding mechanism. A./

Re: Identifiers

2001-04-15 Thread root
As far as I know, the various language committees have settled on Normalization Form C for identifiers. ECMAscript (nee LiveScript, then JavaScript) being 1 example. Perl will likely do the same. James. Martin Duerst wrote: ... > Of course, normalization (preferably NFC and/or NFKC, to stay

Re: Identifiers

2001-04-15 Thread Martin Duerst
Hello Florian, Of course, KC/KD-normalization is not sufficient. The problem already exists in ASCII. I/l/1 and 0/O can easily be confused. It will always be necessary for people to think a bit when creating their email addresses,... On the other hand, when identifiers can be written in various

Re: Identifiers

2001-04-15 Thread David Starner
On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote: > Is it sufficient to mandate that all such identifiers MUST be KC- or > KD-normalized? Does this guarantee print-and-enter round-trip > compatibility? In general, the problem is unsolvable. There are several look-alikes

RE: Identifiers

2001-04-15 Thread Yves Arrouye
> Is it sufficient to mandate that all such identifiers MUST be KC- or > KD-normalized? Does this guarantee print-and-enter round-trip > compatibility? It depends on the accuracy of both the printer or the reader. So I'd say no. People won't necessarily mae the difference b

Identifiers

2001-04-15 Thread Florian Weimer
Unicode is finally entering domains which were ASCII-only for decades. However, with some kinds of identifiers, new problems occur. Such identifiers are interpreted by humans and machines, and they have to survive printing and reentering. Furthermore, it might not be possible to check

RE: Bidi/Hebrew Identifiers

2001-03-13 Thread Cathy Wissink
, 2001 9:42 PM To: Unicode List Cc: Mati Allouche; Israel Gidali Subject: Bidi/Hebrew Identifiers Following are suggestions for the modification or elaboration of the Unicode rules regarding identifiers with respect to Hebrew. These suggestions were discussed at a technical meeting of the SII, and

Bidi/Hebrew Identifiers

2001-03-12 Thread Jonathan Rosenne
Following are suggestions for the modification or elaboration of the Unicode rules regarding identifiers with respect to Hebrew. These suggestions were discussed at a technical meeting of the SII, and comments are requested. Reference: TUS 3.0 5.16 page 133 Cantillation marks (0591 to 05AF