Erik Corry wrote:
Steven Levithan wrote:
Anyway, this is probably all moot, unless someone wants to officially
propose POSIX character classes for ES RegExp. ...In which case I'll be
happy to state about a half-dozen reasons to not do so. :)

Please do, they seem quite sensible to me.

My main objections are due to the POSIX character class syntax itself, and my preference for introducing Unicode categories using \p{..} instead. But to get down a little more detail...

* They're backward incompatible. /[[:name:]]/ is currently equivalent to /[\[:aemn]\]/ in web-reality. Granted, this probably won't be a big deal for existing code, but because they're not currently an error, their use could cause latent bugs in old browsers that don't support them and treat them as part of a character class's set.

* They work inside of bracket expressions only. This is clumsy and needlessly confusing. [:alnum:] outside of a bracket expression would probably have to continue to be equivalent to [:almnu], which would lead to at least occasional developer frustration and bugs.

* Since the exact characters they match differs between regex libraries (beyond just Unicode version variation), they would contribute to the existing landscape of regex features that seem to be portable but actually work slightly differently in different places. We need less of this.

* They are either rarely useful or only minor conveniences over existing shorthands, explicit character classes, or Unicode categories that could be matched using \p{..} in more standardized fashion.

* Other implementations, at least, do not allow them to be negated on their own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using them in negated bracket expressions, but that may negate more than you want.

* If ES ever adopts .NET/XPath-style character class subtraction or Java-style character class intersection (the latter was on the cards for ES4), their syntax would become even more confusing.

* Bonus pompous bullet point: IMO, there are more useful and important new RegExp features to focus on, including support for Unicode categories (which, IMO, are regex's new and improved version of POSIX character classes). My personal wishlist would probably include at least 20 new regex features above POSIX character classes, even if they were introduced using the \p{..} syntax (which is how Java included them).

* Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls them character classes, and calls their container a bracket expression. JavaScripters already call the container a character class. (Not an objection, per se. Presumably we could call them something like "POSIX shorthands" to avoid confusion.)

I'd have no actual objections to adding them using the \p{Name} syntax (as Java does), especially if there is demand for them among regex power-users (you're the first person who I've seen strongly advocate for them). However, I'd still have concerns about exactly which names are added, exactly what they match, and their compatibility with other regex flavors.

In fact \w with Unicode support seems very similar to [:alnum:] to me.
 If this one is useful are there not other Unicode categories that
would be useful?

\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for [[:alnum:]], for compatibility reasons, would probably be [\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive (if you like that exact set) or a negative (many users will think it's equivalent to \w with Unicode even though it isn't).

As you said, though, Unicode categories are indeed quite useful. Unicode scripts, too. I'd advocate for them alongside you. Because of how useful they are, I've even made them usable via my XRegExp JavaScript library (see http://git.io/xregexp ). That lib has a relatively small but enthusiastic user base and is seeing increasing use in server-side JS, where the overhead of loading long Unicode code point ranges doesn't matter as much. But, so long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue that even Unicode categories and scripts are less important than various other features I've mentioned recently on es-discuss, including named capture and atomic groups.

-- Steven Levithan


_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Reply via email to