On Tue, Jun 21, 2016 at 1:16 PM, Joe Groff <[email protected]> wrote: > > > On Jun 21, 2016, at 8:47 AM, John McCall via swift-evolution < > [email protected]> wrote: > > > >> On Jun 20, 2016, at 7:07 PM, Xiaodi Wu <[email protected]> wrote: > >> On Mon, Jun 20, 2016 at 8:58 PM, John McCall via swift-evolution < > [email protected]> wrote: > >>> On Jun 20, 2016, at 5:22 PM, Jordan Rose via swift-evolution < > [email protected]> wrote: > >>> IIRC, some languages require zero-width joiners (though not zero-width > spaces, which are distinct) to properly encode some of their characters. > I'd be very leery of having Swift land on a model where identifiers can be > used with some languages and not others; that smacks of ethnocentrism. > >> > >> None of those languages require zero-width characters between two Latin > letters, or between a Latin letter and an Arabic numeral, or at the end of > a word. Since standard / system APIs will (barring some radical shift) use > those code points exclusively, it's justifiable to give them some special > attention. > >> > >> Although the practical implementation may need to be more limited in > scope, the general principle doesn't need to privilege Latin letters and > Arabic numerals. If, in any context, the presence or absence of a > zero-width glyph cannot possibly be distinguished by a human reading the > text, then the compiler should also be indifferent to its presence or > absence (or, alternatively, its presence should be a compile-time error). > > > > Sure, that's obvious. Jordan was observing that the simplest way to > enforce that, banning such characters from identifiers completely, would > still interfere with some languages, and I was pointing out that just doing > enough to protect English would get most of the practical value because it > would protect every use of the system and standard library. A program > would then only become attackable in this specific way for its own > identifiers using non-Latin characters. > > > > All that said, I'm not convinced that this is worthwhile; the > identifier-similarity problem in Unicode is much broader than just > invisible characters. In fact, Swift still doesn't canonicalize > identifiers, so canonically equivalent compositions of the same glyph will > actually produce different names. So unless we're going to fix that and > then ban all sorts of things that are known to generally be represented > with a confusable glyph in a typical fixed-width font (like the > mathematical alphabets), this is just a problem that will always exist in > some form. > > Any discussion about this ought to start from UAX #31, the Unicode > consortium's recommendations on identifiers in programming languages: > > http://unicode.org/reports/tr31/ > > Section 2.3 specifically calls out the situations in which ZWJ and ZWNJ > need to be allowed. The document also describes a stability policy for > handling new Unicode versions, other confusability issues, and many of the > other problems with adopting Unicode in a programming language's syntax. >
That's a fantastic document--a very edifying read. Given Swift's robust support for Unicode in its core libraries, it's kind of surprising to me that identifiers aren't canonicalized at compile time. From a quick first read, faithful adoption of UAX #31 recommendations would address most if not all of the confusability and zero-width security issues raised in this conversation. > > -Joe
_______________________________________________ swift-evolution mailing list [email protected] https://lists.swift.org/mailman/listinfo/swift-evolution
