Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-07 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via Digitalmars-d-learn wrote: > On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote: > > >This looks wrong to me. Are you sure this finds *all* possible > >graphemes? > > No, the data I gave was to detect a complete code unit. Gra

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Jacob Carlborg via Digitalmars-d-learn
On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote: This looks wrong to me. Are you sure this finds *all* possible graphemes? No, the data I gave was to detect a complete code unit. Graphemes are something else, I think Uranuz is mixing up the Unicode terms. -- /Jacob Carlborg

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread anonymous via Digitalmars-d-learn
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote: ( str[index] & 0b1000 ) == 0 || ( str[index] & 0b1110 ) == 0b1100 || ( str[index] & 0b ) == 0b1110 || ( str[index] & 0b1000 ) == 0b If it is true it means that first byte of sequence found and I can cou

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread H. S. Teoh via Digitalmars-d-learn
On Mon, Oct 06, 2014 at 05:28:43PM +, Uranuz via Digitalmars-d-learn wrote: > > > >Have a look here [1]. For example, if you have a byte that is between > >U+0080 and U+07FF you know that you need two bytes to get that whole > >code point. > > > >[1] http://en.wikipedia.org/wiki/UTF-8#Descripti

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread ketmar via Digitalmars-d-learn
On Mon, 06 Oct 2014 17:28:43 + Uranuz via Digitalmars-d-learn wrote: > If it is true it means that first byte of sequence found and I > can count them. Am I right that it equals to number of graphemes, > or are there some exceptions from this rule? alot. take for example RIGHT-TO-LEFT MARK,

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Uranuz via Digitalmars-d-learn
Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#Description Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with using

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Kagamin via Digitalmars-d-learn
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote: Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Nicolas F. via Digitalmars-d-learn
Unicode is hard to deal with properly as how you deal with it is very context dependant. One grapheme is a visible character and consists of one or more codepoints. One codepoint is one mapping of a byte sequence to a meaning, and consists of one or more bytes. This you do not want to deal with

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread Jacob Carlborg via Digitalmars-d-learn
On 2014-10-05 14:09, Uranuz wrote: Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch < 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new s

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread Uranuz via Digitalmars-d-learn
You can use std.uni.byGrapheme to iterate by graphemes: http://dlang.org/phobos/std_uni.html#.byGrapheme AFAIK, graphemes are not "self synchronizing", but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread monarch_dodra via Digitalmars-d-learn
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote: I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT sw

How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread Uranuz via Digitalmars-d-learn
I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex, cod