Re: Working with grapheme clusters

2013-10-27 Thread Mathias Bynens
On 26 Oct 2013, at 14:39, Bjoern Hoehrmann wrote: > * Norbert Lindenberg wrote: >> On Oct 25, 2013, at 18:35 , Jason Orendorff >> wrote: >> >>> UTF-16 is designed so that you can search based on code units >>> alone, without computing boundaries. RegExp searches fall in this >>> category. >>

Re: Working with grapheme clusters

2013-10-26 Thread Norbert Lindenberg
On Oct 26, 2013, at 6:58 , Jason Orendorff wrote: > On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg > wrote: >> >> On Oct 25, 2013, at 18:35 , Jason Orendorff >> wrote: >> >>> UTF-16 is designed so that you can search based on code units >>> alone, without computing boundaries. RegExp

Re: Working with grapheme clusters

2013-10-26 Thread Norbert Lindenberg
On Oct 26, 2013, at 5:39 , Bjoern Hoehrmann wrote: > * Norbert Lindenberg wrote: >> On Oct 25, 2013, at 18:35 , Jason Orendorff >> wrote: >> >>> UTF-16 is designed so that you can search based on code units >>> alone, without computing boundaries. RegExp searches fall in this >>> category. >>

Re: Working with grapheme clusters

2013-10-26 Thread Bjoern Hoehrmann
* Claude Pache wrote: >You might know that the following ES expressions are broken: > > text.charAt(0) // get the first character of the text > text.length > 100 ? text.substring(0,100) + '...' : text // cut the > text after 100 characters > >The reason is *not* because ES works with U

Re: Working with grapheme clusters

2013-10-26 Thread Jason Orendorff
On Fri, Oct 25, 2013 at 11:42 PM, Norbert Lindenberg wrote: > > On Oct 25, 2013, at 18:35 , Jason Orendorff wrote: > >> UTF-16 is designed so that you can search based on code units >> alone, without computing boundaries. RegExp searches fall in this >> category. > > Not if the RegExp is case ins

Re: Working with grapheme clusters

2013-10-26 Thread Bjoern Hoehrmann
* Norbert Lindenberg wrote: >On Oct 25, 2013, at 18:35 , Jason Orendorff wrote: > >> UTF-16 is designed so that you can search based on code units >> alone, without computing boundaries. RegExp searches fall in this >> category. > >Not if the RegExp is case insensitive, or uses a character class,

Re: Working with grapheme clusters

2013-10-25 Thread Norbert Lindenberg
On Oct 25, 2013, at 18:35 , Jason Orendorff wrote: > UTF-16 is designed so that you can search based on code units > alone, without computing boundaries. RegExp searches fall in this > category. Not if the RegExp is case insensitive, or uses a character class, or ".", or a quantifier - these a

Re: Working with grapheme clusters

2013-10-25 Thread Norbert Lindenberg
On Oct 24, 2013, at 7:38 , Anne van Kesteren wrote: > On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens wrote: >> Imagine you’re writing a JavaScript library that escapes a given string as >> an HTML character reference, or as a CSS identifier, or anything else. In >> those cases, you don’t car

Re: Working with grapheme clusters

2013-10-25 Thread Norbert Lindenberg
The internationalization working group is planning to support grapheme clusters through its text segmentation API - the strawman: http://wiki.ecmascript.org/doku.php?id=globalization:text_segmentation Note that Unicode Standard Annex #29 allows for tailored (language sensitive) grapheme clusters

Re: Working with grapheme clusters

2013-10-25 Thread Jason Orendorff
On Thu, Oct 24, 2013 at 7:38 AM, Anne van Kesteren wrote: > On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens wrote: >> Imagine you’re writing a JavaScript library that escapes a given string as >> an HTML character reference, or as a CSS identifier, or anything else. In >> those cases, you don’t

Re: Working with grapheme clusters

2013-10-24 Thread Bjoern Hoehrmann
* Mathias Bynens wrote: >Out of curiosity, is there any programming language that operates on >grapheme clusters (rather than code points) by default? FWIW, code point >iteration is what I’d expect in any language. It is the specified default for Perl 6 that can be modified through lexically sco

Re: Working with grapheme clusters

2013-10-24 Thread Claude Pache
Le 24 oct. 2013 à 16:24, Mathias Bynens a écrit : > >> text.graphemeAt(0) // get the first grapheme of the text >> >> // shorten a text to its first hundred graphemes >> var shortenText = '' >> let numGraphemes = 0 >> for (let grapheme of text) { >> numGra

Re: Working with grapheme clusters

2013-10-24 Thread Anne van Kesteren
On Thu, Oct 24, 2013 at 3:31 PM, Mathias Bynens wrote: > Imagine you’re writing a JavaScript library that escapes a given string as an > HTML character reference, or as a CSS identifier, or anything else. In those > cases, you don’t care about grapheme clusters, you care about code points, > ca

Re: Working with grapheme clusters

2013-10-24 Thread Mathias Bynens
On 24 Oct 2013, at 16:22, Anne van Kesteren wrote: > On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache wrote: >> As a side note, I ask whether the `String.prototype.symbolAt >> `/`String.prototype.at` as proposed in a recent thread, >> and the `String.prototype[@@iterator]` as currently specified,

Re: Working with grapheme clusters

2013-10-24 Thread Mathias Bynens
On 24 Oct 2013, at 16:02, Claude Pache wrote: > Therefore, I propose the following basic operations to operate on grapheme > clusters: Out of curiosity, is there any programming language that operates on grapheme clusters (rather than code points) by default? FWIW, code point iteration is wha

Re: Working with grapheme clusters

2013-10-24 Thread Anne van Kesteren
On Thu, Oct 24, 2013 at 3:02 PM, Claude Pache wrote: > As a side note, I ask whether the `String.prototype.symbolAt > `/`String.prototype.at` as proposed in a recent thread, > and the `String.prototype[@@iterator]` as currently specified, are really > what people need, > or if they would mistake

Working with grapheme clusters

2013-10-24 Thread Claude Pache
Hello, You might know that the following ES expressions are broken: text.charAt(0) // get the first character of the text text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters The reason is *not* because ES works with UTF-16 code units ins