On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <hsivo...@hsivonen.fi> wrote: > Considering that ruling out too much can be a problem later, but just > treating anything above ASCII as opaque hasn't caused trouble (that I > know of) for HTML other than compatibility issues with XML's stricter > stance, why should a programming language, if it opts to support > non-ASCII identifiers in an otherwise ASCII core syntax, implement the > complexity of UAX #31 instead of allowing everything above ASCII in > identifiers? In other words, what problem does making a programming > language conform to UAX #31 solve?
After refreshing my memory of XML history, I realize that mentioning XML does not helpfully illustrate my question despite the mention of XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please ignore the XML part. Trying to rephrase my question more clearly: Let's assume that we are designing a computer-parseable syntax where tokens consisting of user-chosen characters can't occur next to each other and, instead, always have some syntax-reserved characters between them. That is, I'm talking about syntaxes that look like this (could be e.g. Java): ab.cd(); Here, ab and cd are tokens with user-chosen characters whereas space (the indent), period, parenthesis and the semicolon are syntax-reserved. We know that ab and cd are distinct tokens, because there is a period between them, and we know the opening parethesis ends the cd token. To illustrate what I'm explicitly _not_ talking about, I'm not talking about a syntax like this: αβ⊗γδ Here αβ and γδ are user-named variable names and ⊗ is a user-named operator and the distinction between different kinds of user-named tokens has to be known somehow in order to be able to tell that there are three distinct tokens: αβ, ⊗, and γδ. My question is: When designing a syntax where tokens with the user-chosen characters can't occur next to each other without some syntax-reserved characters between them, what advantages are there from limiting the user-chosen characters according to UAX #31 as opposed to treating any character that is not a syntax-reserved character as a character that can occur in user-named tokens? I understand that taking the latter approach allows users to mint tokens that on some aesthetic measure don't make sense (e.g. minting tokens that consist of glyphless code points), but why is it important to prescribe that this is prohibited as opposed to just letting users choose not to mint tokens that are inconvenient for them to work with given the behavior that their plain text editor gives to various characters? That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use? The introduction of XID after ID and the introduction of Extended Hashtag Identifiers after XID is indicative of over-restriction having been a problem. Limiting user-minted tokens to UAX #31 does not appear to be necessary for security purposes considering that HTML and CSS exist in a particularly adversarial environment and get away with taking the approach that any character that isn't a syntax-reserved character is collected as part of a user-minted identifier. (Informally, both treat non-ASCII characters the same as an ASCII underscore. HTML even treats non-whitespace, non-U+0000 ASCII controls that way.) -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/