Erik Corry wrote:
Steven Levithan wrote:
Anyway, this is probably all moot, unless someone wants to officially
propose POSIX character classes for ES RegExp. ...In which case I'll be
happy to state about a half-dozen reasons to not do so. :)
Please do, they seem quite sensible to me.
My main objections are due to the POSIX character class syntax itself, and
my preference for introducing Unicode categories using \p{..} instead. But
to get down a little more detail...
* They're backward incompatible. /[[:name:]]/ is currently equivalent to
/[\[:aemn]\]/ in web-reality. Granted, this probably won't be a big deal for
existing code, but because they're not currently an error, their use could
cause latent bugs in old browsers that don't support them and treat them as
part of a character class's set.
* They work inside of bracket expressions only. This is clumsy and
needlessly confusing. [:alnum:] outside of a bracket expression would
probably have to continue to be equivalent to [:almnu], which would lead to
at least occasional developer frustration and bugs.
* Since the exact characters they match differs between regex libraries
(beyond just Unicode version variation), they would contribute to the
existing landscape of regex features that seem to be portable but actually
work slightly differently in different places. We need less of this.
* They are either rarely useful or only minor conveniences over existing
shorthands, explicit character classes, or Unicode categories that could be
matched using \p{..} in more standardized fashion.
* Other implementations, at least, do not allow them to be negated on their
own, unlike \p{..} (via \P{..} or \p{^..}). They can be negated by using
them in negated bracket expressions, but that may negate more than you want.
* If ES ever adopts .NET/XPath-style character class subtraction or
Java-style character class intersection (the latter was on the cards for
ES4), their syntax would become even more confusing.
* Bonus pompous bullet point: IMO, there are more useful and important new
RegExp features to focus on, including support for Unicode categories
(which, IMO, are regex's new and improved version of POSIX character
classes). My personal wishlist would probably include at least 20 new regex
features above POSIX character classes, even if they were introduced using
the \p{..} syntax (which is how Java included them).
* Bonus nitpick: The name of the syntax itself causes confusion. POSIX calls
them character classes, and calls their container a bracket expression.
JavaScripters already call the container a character class. (Not an
objection, per se. Presumably we could call them something like "POSIX
shorthands" to avoid confusion.)
I'd have no actual objections to adding them using the \p{Name} syntax (as
Java does), especially if there is demand for them among regex power-users
(you're the first person who I've seen strongly advocate for them). However,
I'd still have concerns about exactly which names are added, exactly what
they match, and their compatibility with other regex flavors.
In fact \w with Unicode support seems very similar to [:alnum:] to me.
If this one is useful are there not other Unicode categories that
would be useful?
\w with Unicode should match [\p{L}\{Nd}_]. The best way to go for
[[:alnum:]], for compatibility reasons, would probably be
[\p{Ll}\p{Lu}\p{Lt}\p{Nd}]. This difference could be argued as a positive
(if you like that exact set) or a negative (many users will think it's
equivalent to \w with Unicode even though it isn't).
As you said, though, Unicode categories are indeed quite useful. Unicode
scripts, too. I'd advocate for them alongside you. Because of how useful
they are, I've even made them usable via my XRegExp JavaScript library (see
http://git.io/xregexp ). That lib has a relatively small but enthusiastic
user base and is seeing increasing use in server-side JS, where the overhead
of loading long Unicode code point ranges doesn't matter as much. But, so
long as a /u flag is added for switching \w\b\d to Unicode-mode, I'd argue
that even Unicode categories and scripts are less important than various
other features I've mentioned recently on es-discuss, including named
capture and atomic groups.
-- Steven Levithan
_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss