On Tue, Oct 07, 2014 at 08:28:49AM +0200, Jacob Carlborg via
Digitalmars-d-learn wrote:
> On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:
>
> >This looks wrong to me. Are you sure this finds *all* possible
> >graphemes?
>
> No, the data I gave was to detect a complete code unit. Gra
On 06/10/14 19:48, H. S. Teoh via Digitalmars-d-learn wrote:
This looks wrong to me. Are you sure this finds *all* possible
graphemes?
No, the data I gave was to detect a complete code unit. Graphemes are
something else, I think Uranuz is mixing up the Unicode terms.
--
/Jacob Carlborg
On Monday, 6 October 2014 at 17:28:45 UTC, Uranuz wrote:
( str[index] & 0b1000 ) == 0 ||
( str[index] & 0b1110 ) == 0b1100 ||
( str[index] & 0b ) == 0b1110 ||
( str[index] & 0b1000 ) == 0b
If it is true it means that first byte of sequence found and I
can cou
On Mon, Oct 06, 2014 at 05:28:43PM +, Uranuz via Digitalmars-d-learn wrote:
> >
> >Have a look here [1]. For example, if you have a byte that is between
> >U+0080 and U+07FF you know that you need two bytes to get that whole
> >code point.
> >
> >[1] http://en.wikipedia.org/wiki/UTF-8#Descripti
On Mon, 06 Oct 2014 17:28:43 +
Uranuz via Digitalmars-d-learn
wrote:
> If it is true it means that first byte of sequence found and I
> can count them. Am I right that it equals to number of graphemes,
> or are there some exceptions from this rule?
alot. take for example RIGHT-TO-LEFT MARK,
Have a look here [1]. For example, if you have a byte that is
between U+0080 and U+07FF you know that you need two bytes to
get that whole code point.
[1] http://en.wikipedia.org/wiki/UTF-8#Description
Thanks. I solved it myself already for UTF-8 encoding. There
choosed approach with using
On Sunday, 5 October 2014 at 12:09:34 UTC, Uranuz wrote:
Maybe there is some idea how to just detect first code unit of
grapheme without overhead for using Grapheme struct? I just
tried to check if ch < 128 (for UTF-8). But this dont work. How
to check if byte is continuation of code for single
Unicode is hard to deal with properly as how you deal with it is
very context dependant.
One grapheme is a visible character and consists of one or more
codepoints. One codepoint is one mapping of a byte sequence to a
meaning, and consists of one or more bytes.
This you do not want to deal with
On 2014-10-05 14:09, Uranuz wrote:
Maybe there is some idea how to just detect first code unit of grapheme
without overhead for using Grapheme struct? I just tried to check if ch
< 128 (for UTF-8). But this dont work. How to check if byte is
continuation of code for single code point or if new s
You can use std.uni.byGrapheme to iterate by graphemes:
http://dlang.org/phobos/std_uni.html#.byGrapheme
AFAIK, graphemes are not "self synchronizing", but codepoints
are. You can pop code units until you reach the beginning of a
new codepoint. From there, you can iterate by graphemes, though
On Sunday, 5 October 2014 at 08:27:58 UTC, Uranuz wrote:
I have struct StringStream that I use to go through and parse
input string. String could be of string, wstring or dstring
type. I implement function popChar that reads codeUnit from
Stream. I want to have *debug* mode of parser (via CT sw
I have struct StringStream that I use to go through and parse
input string. String could be of string, wstring or dstring type.
I implement function popChar that reads codeUnit from Stream. I
want to have *debug* mode of parser (via CT switch), where I
could get information about lineIndex, cod
12 matches
Mail list logo