Re: Full Unicode based on UTF-16 proposal

Steven L. Sun, 18 Mar 2012 17:29:54 -0700

Erik Corry wrote:

Steven Levithan wrote:

Anyway, this is probably all moot, unless someone wants to officially
propose POSIX character classes for ES RegExp. ...In which case I'll be
happy to state about a half-dozen reasons to not do so. :)


Please do, they seem quite sensible to me.

My main objections are due to the POSIX character class syntax itself, andmy preference for introducing Unicode categories using \p{..} instead. Butto get down a little more detail...

* They're backward incompatible. /[[:name:]]/ is currently equivalent to/[\[:aemn]\]/ in web-reality. Granted, this probably won't be a big deal forexisting code, but because they're not currently an error, their use couldcause latent bugs in old browsers that don't support them and treat them aspart of a character class's set.

* They work inside of bracket expressions only. This is clumsy andneedlessly confusing. [:alnum:] outside of a bracket expression wouldprobably have to continue to be equivalent to [:almnu], which would lead toat least occasional developer frustration and bugs.

* Since the exact characters they match differs between regex libraries(beyond just Unicode version variation), they would contribute to theexisting landscape of regex features that seem to be portable but actuallywork slightly differently in different places. We need less of this.

* They are either rarely useful or only minor conveniences over existingshorthands, explicit character classes, or Unicode categories that could bematched using \p{..} in more standardized fashion.

* Other implementations, at least, do not allow them to be negated on theirown, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by usingthem in negated bracket expressions, but that may negate more than you want.

* If ES ever adopts .NET/XPath-style character class subtraction orJava-style character class intersection (the latter was on the cards forES4), their syntax would become even more confusing.

* Bonus pompous bullet point: IMO, there are more useful and important newRegExp features to focus on, including support for Unicode categories(which, IMO, are regex's new and improved version of POSIX characterclasses). My personal wishlist would probably include at least 20 new regexfeatures above POSIX character classes, even if they were introduced usingthe \p{..} syntax (which is how Java included them).

* Bonus nitpick: The name of the syntax itself causes confusion. POSIX callsthem character classes, and calls their container a bracket expression.JavaScripters already call the container a character class. (Not anobjection, per se. Presumably we could call them something like "POSIXshorthands" to avoid confusion.)

I'd have no actual objections to adding them using the \p{Name} syntax (asJava does), especially if there is demand for them among regex power-users(you're the first person who I've seen strongly advocate for them). However,I'd still have concerns about exactly which names are added, exactly whatthey match, and their compatibility with other regex flavors.

In fact \w with Unicode support seems very similar to [:alnum:] to me.
 If this one is useful are there not other Unicode categories that
would be useful?

\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for[[:alnum:]], for compatibility reasons, would probably be[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive(if you like that exact set) or a negative (many users will think it'sequivalent to \w with Unicode even though it isn't).

As you said, though, Unicode categories are indeed quite useful. Unicodescripts, too. I'd advocate for them alongside you. Because of how usefulthey are, I've even made them usable via my XRegExp JavaScript library (seehttp://git.io/xregexp ). That lib has a relatively small but enthusiasticuser base and is seeing increasing use in server-side JS, where the overheadof loading long Unicode code point ranges doesn't matter as much. But, solong as a /u flag is added for switching \w\b\d to Unicode-mode, I'd arguethat even Unicode categories and scripts are less important than variousother features I've mentioned recently on es-discuss, including namedcapture and atomic groups.


-- Steven Levithan


_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: Full Unicode based on UTF-16 proposal

Reply via email to