On 26 Oct 2013, at 14:39, Bjoern Hoehrmann <derhoe...@gmx.net> wrote:
> * Norbert Lindenberg wrote: >> On Oct 25, 2013, at 18:35 , Jason Orendorff <jason.orendo...@gmail.com> >> wrote: >> >>> UTF-16 is designed so that you can search based on code units >>> alone, without computing boundaries. RegExp searches fall in this >>> category. >> >> Not if the RegExp is case insensitive, or uses a character class, or ".", or >> a >> quantifier - these all require looking at code points rather than UTF-16 code >> units in order to support the full Unicode character set. > > If you have a regular expression over an alphabet like "Unicode scalar > values" it is easy to turn it into an equivalent regular expression over > an alphabet like "UTF-16 code units". FWIW, [Regenerate](http://mths.be/regenerate) is a JavaScript library that can be used for this. A few examples from <http://mathiasbynens.be/notes/javascript-unicode#regex>: > Here’s a regular expression is created that matches any Unicode scalar value: > > >> regenerate() > .addRange(0x0, 0x10FFFF) // all Unicode code points > .removeRange(0xD800, 0xDBFF) // minus high surrogates > .removeRange(0xDC00, 0xDFFF) // minus low surrogates > .toRegExp() > /[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]/ Similarly, to polyfill `.` in a Unicode-enabled ES6 regex: > When the `u` flag is set, `.` is equivalent to the following > backwards-compatible regular expression pattern: > > >> regenerate() > .addRange(0x0, 0x10FFFF) // all Unicode code points > .remove( // minus `LineTerminator`s > (http://ecma-international.org/ecma-262/5.1/#sec-7.3): > 0x000A, // Line Feed <LF> > 0x000D, // Carriage Return <CR> > 0x2028, // Line Separator <LS> > 0x2029 // Paragraph Separator <PS> > ) > .toString(); > > '[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF]' > > >> > /foo(?:[\0-\x09\x0B\x0C\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])bar/u.test('foo💩bar') > true _______________________________________________ es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss