On 4/25/12 9:04 AM, Jukka K. Korpela wrote:
This can be really awkward when you would need advanced tools like
Unicode regular expressions (JavaScript has just Ascii regexps)
Regular Expressions for Unicode in JavaScript are possible but very
awkward because it requires manipulations of UTF-16 surrogate pairs.
The example in the previous linked PDF states:
"You will never be able to write regexes like [𝒜-𝒵] since that gets
misinterpreted as [\uD835\uDC9C-\uD835\uDCB5]"
This regex can be rewritten as \uD835[\uDC9C-\uDCB5], which will work
just fine.
If you're searching for a range where the lead surrogate changes, you'd
need to use the pipe | to combine 2 different patterns. To match from
U+10000 - U+107FF, you'd need to combine the following 2 ranges:
1) \uD800[\uDC00-\uDFFF]
2) \uD801[\uDC00-\uDFFF]
The Regular Expression would be:
(\uD800[\uDC00-\uDFFF]|\uD801[\uDC00-\uDFFF])
It's ugly and round about, but it does work.
Just saying,
-Steve