On 4/25/12 9:04 AM, Jukka K. Korpela wrote:
This can be really awkward when you would need advanced tools like Unicode regular expressions (JavaScript has just Ascii regexps)

Regular Expressions for Unicode in JavaScript are possible but very awkward because it requires manipulations of UTF-16 surrogate pairs.

The example in the previous linked PDF states:
"You will never be able to write regexes like [𝒜-𝒵] since that gets misinterpreted as [\uD835\uDC9C-\uD835\uDCB5]"

This regex can be rewritten as \uD835[\uDC9C-\uDCB5], which will work just fine.

If you're searching for a range where the lead surrogate changes, you'd need to use the pipe | to combine 2 different patterns. To match from U+10000 - U+107FF, you'd need to combine the following 2 ranges:
1) \uD800[\uDC00-\uDFFF]
2) \uD801[\uDC00-\uDFFF]

The Regular Expression would be:
(\uD800[\uDC00-\uDFFF]|\uD801[\uDC00-\uDFFF])

It's ugly and round about, but it does work.

Just saying,
-Steve

Reply via email to