On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:
This looks wrong to me. Are you sure this finds *all* possible
graphemes?
No, the data I gave was to detect a complete code unit. Graphemes are
something else, I think Uranuz is mixing up the Unicode terms.
--
/Jacob Carlborg
On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via
Digitalmars-d-learn wrote:
On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:
This looks wrong to me. Are you sure this finds *all* possible
graphemes?
No, the data I gave was to detect a complete code unit. Graphemes
Unicode is hard to deal with properly as how you deal with it is
very context dependant.
One grapheme is a visible character and consists of one or more
codepoints. One codepoint is one mapping of a byte sequence to a
meaning, and consists of one or more bytes.
This you do not want to deal with
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:
Maybe there is some idea how to just detect first code unit of
grapheme without overhead for using Grapheme struct? I just
tried to check if ch 128 (for UTF-8). But this dont work. How
to check if byte is continuation of code for single
Have a look here [1]. For example, if you have a byte that is
between U+0080 and U+07FF you know that you need two bytes to
get that whole code point.
[1] http://en.wikipedia.org/wiki/UTF-8#Description
Thanks. I solved it myself already for UTF-8 encoding. There
choosed approach with
On Mon, 06 Oct 2014 17:28:43 +
Uranuz via Digitalmars-d-learn digitalmars-d-learn@puremagic.com
wrote:
If it is true it means that first byte of sequence found and I
can count them. Am I right that it equals to number of graphemes,
or are there some exceptions from this rule?
alot. take
On Mon, Oct 06, 2014 at 05:28:43PM +, Uranuz via Digitalmars-d-learn wrote:
Have a look here [1]. For example, if you have a byte that is between
U+0080 and U+07FF you know that you need two bytes to get that whole
code point.
[1] http://en.wikipedia.org/wiki/UTF-8#Description
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
( str[index] 0b1000 ) == 0 ||
( str[index] 0b1110 ) == 0b1100 ||
( str[index] 0b ) == 0b1110 ||
( str[index] 0b1000 ) == 0b
If it is true it means that first byte of sequence found and I
can count
I have struct StringStream that I use to go through and parse
input string. String could be of string, wstring or dstring type.
I implement function popChar that reads codeUnit from Stream. I
want to have *debug* mode of parser (via CT switch), where I
could get information about lineIndex,
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:
I have struct StringStream that I use to go through and parse
input string. String could be of string, wstring or dstring
type. I implement function popChar that reads codeUnit from
Stream. I want to have *debug* mode of parser (via CT
You can use std.uni.byGrapheme to iterate by graphemes:
http://dlang.org/phobos/std_uni.html#.byGrapheme
AFAIK, graphemes are not self synchronizing, but codepoints
are. You can pop code units until you reach the beginning of a
new codepoint. From there, you can iterate by graphemes, though
On 2014-10-05 14:09, Uranuz wrote:
Maybe there is some idea how to just detect first code unit of grapheme
without overhead for using Grapheme struct? I just tried to check if ch
128 (for UTF-8). But this dont work. How to check if byte is
continuation of code for single code point or if new
12 matches
Mail list logo