发件人: 琉璃井 <pharaoh...@163.com> 发送时间: 2011-04-04 11:00 主 题: Re: [Vala] how can I get the number of unicode points in a string? 收件人:vala-list@gnome.org
于 2011/4/3 21:30, Adam Dingle 写道: > On 04/03/2011 06:08 AM, 琉璃井 wrote: >> From: "琉璃井"<pharaoh...@163.com> >> Date: 2011-04-03 18:15:12 >> To: "Luca Bruno"<lethalma...@gmail.com> >> Subject: Re:Re: [Vala] how can I get the number of unicode points in >> a string? >> >> At 2011-04-03 16:06:32,"Luca Bruno"<lethalma...@gmail.com> wrote: >> >>> On Sun, Apr 03, 2011 at 03:59:23PM +0800, 琉璃井 wrote: >>>> I see that since 0.11.0 vala string.length returns number of bytes >>>> rather than that of unicode characters, and string[i] returns only >>>> one byte. I wonder how to deal with east Asian character strings. >>> There are other methods in string that deal with utf8. For example >>> char_count() and next_char(). >>> >> thank you. >> I find char_count(), get_char() and next_char() in gtk+ document. >> Looks like these methods are not covered in vala tutorial and document. >> Is there something like string[i] for index access to utf8? I didn't >> get it in docs. > > To get the i-th character, you could do this: > > str.get_char(str.index_of_nth_char(i)); > > But the current string methods are designed for iteration by offsets, > not characters. So you should *not* do this, which will be inefficient: > > for (int i = 0 ; i< str.char_count() ; ++i) // don't do this > str.get_char(str.index_of_nth_char(i)); > > Instead, you want to iterate over the string using get_char() and > next_char(). This is slightly inconvenient since these functions use > pointers rather than integer offsets. In Vala trunk, Jürg has just > committed a new method string.get_next_char() which will make it > easier to iterate over strings: > > // in class string > public bool get_next_char (ref int index, out unichar c); > > That isn't in any Vala release yet, though. (In the meantime, you > might be able to copy and paste his implementation from glib-2.0.vapi > in Vala trunk.) > > adam I know get_char and next_char are used for reducing iteration overhead, but there may be other convenient way to access a utf8 string with efficency. After all, getting a byte from a string using offset is not so resonable because people seldom needs to get a byte in a whole character. Is it possible to design the string like this: class string { private unichar* buffer; private int* offset_array; ... ... public unichar operator [](const int i) { int offset=offset_array[i]; return buffer[offset]; } } offset_array stores the offset of utf8 charater by index. It is initialized in constructor or something. Then we can use string[index] with no iteration overhead.
_______________________________________________ vala-list mailing list vala-list@gnome.org http://mail.gnome.org/mailman/listinfo/vala-list