Re: Unicode in C

2012-03-13 Thread Nadav Har'El
On Mon, Mar 12, 2012, Omer Zak wrote about Re: Unicode in C: It depends upon your tradeoffs. ... 2. Otherwise, specify two such APIs - one is UTF-8 based, one is fixed size wide character based. Create two binary variants of the libhspell ... This is why I asked this question in the first

Re: Unicode in C

2012-03-13 Thread Nadav Har'El
On Tue, Mar 13, 2012, kobi zamir wrote about Re: Unicode in C: imho because hspell only use hebrew, it can internally continue to use hebrew only charset without nikud iso-8859-8 (or with nikud win-1255). I agree, and this has been my feeling all along. By using iso-8859-8 internally

Re: Unicode in C

2012-03-13 Thread kobi zamir
So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. gtk likes unicode and utf-8: http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html qt likes more options:

Re: Unicode in C

2012-03-13 Thread Ely Levy
I don't think that input/output matters so much, In something like hspell I/O should be modular so later on encoding can be added. After all it already has function to translate to/from internal representation. I believe that iso-8859-8 and utf8 should be good enough for starts. Ely 2012/3/13

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich
2012/3/13 kobi zamir kobi.za...@gmail.com So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. Python's internal representation is not UTF-8, but UTF-16, or UTF-32, depends on build parameters. Thus

Re: Unicode in C

2012-03-13 Thread Meir Kriheli
Hi, 2012/3/13 Elazar Leibovich elaz...@gmail.com 2012/3/13 kobi zamir kobi.za...@gmail.com So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. Python's internal representation is not UTF-8, but

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich
On Tue, Mar 13, 2012 at 1:19 PM, Meir Kriheli mkrih...@gmail.com wrote: Nitpick: It's actually ucs2/ucs4 (which preceded the above but are compatible). Double nitpick, UTF-16 and UCS-2 are identical representation, and it's better to always use the name UTF-16 as the FAQ

Re: Unicode in C

2012-03-13 Thread Dan Kenigsberg
On Mon, Mar 12, 2012 at 03:05:56PM +0200, Nadav Har'El wrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these

Re: Unicode in C

2012-03-13 Thread Nadav Har'El
On Tue, Mar 13, 2012, Dan Kenigsberg wrote about Re: Unicode in C: In my opinion, it is nice to fit to modern standards of your major target environment (read: utf8), but not necessary to cater to all encodings. It appears that the consensus on this list is that UTF-8 is indeed the right way

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich
On Tue, Mar 13, 2012 at 5:22 PM, Nadav Har'El n...@math.technion.ac.ilwrote: Qt appears to use internally UTF-16. What major free software C library actually prefer UTF-8? Are you talking about the internal representation, or the external interface? The internal representation is in many

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich
Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B) with just a KAF (U+05DB) etc. I guess that you're doing that already to some degree in hspell, so (in case you're translating to

Re: Unicode in C

2012-03-13 Thread Nadav Har'El
On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C: Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B) with just a KAF (U+05DB) etc. Is this really important? Does

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich
On Tue, Mar 13, 2012 at 10:16 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C: Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF

Re: Unicode in C

2012-03-13 Thread Daniel Shahaf
Nadav Har'El wrote on Tue, Mar 13, 2012 at 22:16:23 +0200: On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C: Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B

Re: Unicode in C

2012-03-13 Thread kobi zamir
imho: hspell does hebrew spelling well. we have iconv, glib, qt ... for doing encoding conversions well. http://en.wikipedia.org/wiki/Unix_philosophy#McIlroy:_A_Quarter_Century_of_Unix on the other side, it will be very nice to have a utf-8 interface to hspell :-)

Re: Unicode in C

2012-03-12 Thread Omer Zak
It depends upon your tradeoffs. If you use mostly Western fonts (Latin, Hebrew, etc.) and want to economize on memory use, use UTF-8. However, for Chinese, it costs more memory than it saves. If you need to use Far Eastern fonts and/or have random access for your text, use fixed size wide

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich
On Mon, Mar 12, 2012 at 3:20 PM, Omer Zak w...@zak.co.il wrote: If you need to use Far Eastern fonts and/or have random access for your text, use fixed size wide character encoding (16 bit or 32 bit size). Note that UTF-16, doesn't really offer random access, due to surrogate pairs (not all

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich
The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. Do you mind using iconv-like library? On

Re: Unicode in C

2012-03-12 Thread Dov Grobgeld
My suggestion is go the glib/gtk approach and use utf-8 everywhere and have the API accept char*, i.e. there is no typedef for a unicode character strings. If this is not acceptable because of speed (this is its only tradeoff), then use UCS-4 internally and provide two external interfaces for

Re: Unicode in C

2012-03-12 Thread Ely Levy
What's the advantage of using ucs-4 internally? Especially if the program needs to save memory (embedded devices are pretty common these days). Ely 2012/3/12 Dov Grobgeld dov.grobg...@gmail.com My suggestion is go the glib/gtk approach and use utf-8 everywhere and have the API accept char*,

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich
On Mon, Mar 12, 2012 at 5:39 PM, E L elyl...@cs.huji.ac.il wrote: What's the advantage of using ucs-4 internally? Especially if the program needs to save memory (embedded devices are pretty common these days). UTF-32 or UCS-4, is the only encoding form that allows random access to each

Re: Unicode in C

2012-03-12 Thread Nadav Har'El
On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C: The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich
On Mon, Mar 12, 2012 at 7:37 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C: The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your

Re: Unicode in C

2012-03-12 Thread kobi zamir
...@math.technion.ac.ilwrote: On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C: The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters