Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-07 Thread Jacob Carlborg via Digitalmars-d-learn
On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote: This looks wrong to me. Are you sure this finds *all* possible graphemes? No, the data I gave was to detect a complete code unit. Graphemes are something else, I think Uranuz is mixing up the Unicode terms. -- /Jacob Carlborg

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-07 Thread H. S. Teoh via Digitalmars-d-learn
On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via Digitalmars-d-learn wrote: On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote: This looks wrong to me. Are you sure this finds *all* possible graphemes? No, the data I gave was to detect a complete code unit. Graphemes

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Nicolas F. via Digitalmars-d-learn
Unicode is hard to deal with properly as how you deal with it is very context dependant. One grapheme is a visible character and consists of one or more codepoints. One codepoint is one mapping of a byte sequence to a meaning, and consists of one or more bytes. This you do not want to deal with

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Kagamin via Digitalmars-d-learn
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote: Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread Uranuz via Digitalmars-d-learn
Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#Description Thanks. I solved it myself already for UTF-8 encoding. There choosed approach with

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread ketmar via Digitalmars-d-learn
On Mon, 06 Oct 2014 17:28:43 + Uranuz via Digitalmars-d-learn digitalmars-d-learn@puremagic.com wrote: If it is true it means that first byte of sequence found and I can count them. Am I right that it equals to number of graphemes, or are there some exceptions from this rule? alot. take

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread H. S. Teoh via Digitalmars-d-learn
On Mon, Oct 06, 2014 at 05:28:43PM +, Uranuz via Digitalmars-d-learn wrote: Have a look here [1]. For example, if you have a byte that is between U+0080 and U+07FF you know that you need two bytes to get that whole code point. [1] http://en.wikipedia.org/wiki/UTF-8#Description

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-06 Thread anonymous via Digitalmars-d-learn
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote: ( str[index] 0b1000 ) == 0 || ( str[index] 0b1110 ) == 0b1100 || ( str[index] 0b ) == 0b1110 || ( str[index] 0b1000 ) == 0b If it is true it means that first byte of sequence found and I can count

How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread Uranuz via Digitalmars-d-learn
I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT switch), where I could get information about lineIndex,

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread monarch_dodra via Digitalmars-d-learn
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote: I have struct StringStream that I use to go through and parse input string. String could be of string, wstring or dstring type. I implement function popChar that reads codeUnit from Stream. I want to have *debug* mode of parser (via CT

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread Uranuz via Digitalmars-d-learn
You can use std.uni.byGrapheme to iterate by graphemes: http://dlang.org/phobos/std_uni.html#.byGrapheme AFAIK, graphemes are not self synchronizing, but codepoints are. You can pop code units until you reach the beginning of a new codepoint. From there, you can iterate by graphemes, though

Re: How to detect start of Unicode symbol and count amount of graphemes

2014-10-05 Thread Jacob Carlborg via Digitalmars-d-learn
On 2014-10-05 14:09, Uranuz wrote: Maybe there is some idea how to just detect first code unit of grapheme without overhead for using Grapheme struct? I just tried to check if ch 128 (for UTF-8). But this dont work. How to check if byte is continuation of code for single code point or if new