Re: Unicode in C
On Mon, Mar 12, 2012, Omer Zak wrote about Re: Unicode in C: It depends upon your tradeoffs. ... 2. Otherwise, specify two such APIs - one is UTF-8 based, one is fixed size wide character based. Create two binary variants of the libhspell ... This is why I asked this question in the first place - I'm aware of the tradeoffs, and the possibility two create two variants for every function (or three, if you include the existing ISO-8859-8 API). I was just wondering - could it be that 20 years (!) after UTF-8 was invented for use in Plan 9 to counter wide characters, that neither method has won? As I see it, UTF-8 vs. wide characters (or UTF-16, or UTF-32, which are all similar for my needs) is a big-endian/little-endian kind of issue, where it's possible to list all sorts of advantages to each one, but at the end, each of the choices is good and has a large number of followers, and a choice has to be made. Continuing to use all of these approaches is not a good thing, as I see it. Even if in practice, I can write all these APIs with not too much effort, I think it's ugly. -- Nadav Har'El| Tuesday, Mar 13 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Live as if you were to die tomorrow, http://nadav.harel.org.il |learn as if you were to live forever. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Tue, Mar 13, 2012, kobi zamir wrote about Re: Unicode in C: imho because hspell only use hebrew, it can internally continue to use hebrew only charset without nikud iso-8859-8 (or with nikud win-1255). I agree, and this has been my feeling all along. By using iso-8859-8 internally (and for the basic word lookup, an even more optimized 5-bit encoding) instead of utf-8, Hspell's memory usage is at least halved. it will be helpful if hspell will give the user convenience functions. this functions will that take utf-8 and return utf-8. the functions will convert the utf-8 to the hebrew only coding that hspell will use internally. So I guess that you're also in the UTF-8 camp. That's also the direction I'm leaning. But the question is - will one day after Hspell gets a UTF-8 API, people start complaining why it doesn't have a UTF-16, UTF-32, or some other sort of API? And don't answer if they want UTF-16, let them use iconv to convert UTF-16 to UTF-8 and back - after all they can do this now with ISO-8859-8 (and like you said, Enchant is doing exactly that) and still people complain ;-) p.s. i will be happy if hspell will give easy to use functions for using the library lingual info. in current version of hspell using lingual info is very hard. see: http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala I agree that the linginfo (aka morphological analyzer) C API needs an overhaul. Out of embarrasment, it's not even documented in hspell(3) :-) It could also have been implemented more efficiently (memory-wise) than it is. But following the maxim If it ain't broken, don't fix it, we haven't touched this code in years :( P.S. Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui has a bug: it claims that החתול might mean ה+חתול with the second word being in construct form (סמיכוך). But this isn't a valid split - the construct form cannot be preceded by the definite article (ה) - and Hspell knows this (try running hspell -al or going to the demo at http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi to check). Similarly, הירוק only has one legal meaning (the green) and the two other meanings listed in the png on your site are *wrong*. So it appears something is wrong with your word splitting code? This is surprising if you're using libhspell... I didn't look at your code to see where it went wrong. Nadav. -- Nadav Har'El| Tuesday, Mar 13 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |And now for some feedback: http://nadav.harel.org.il |EEE ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. gtk likes unicode and utf-8: http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html qt likes more options: http://qt-project.org/doc/qt-4.8/qstring.html Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui has a bug i probably misused the enum_split function, but i do not have time to check it :-( ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
I don't think that input/output matters so much, In something like hspell I/O should be modular so later on encoding can be added. After all it already has function to translate to/from internal representation. I believe that iso-8859-8 and utf8 should be good enough for starts. Ely 2012/3/13 kobi zamir kobi.za...@gmail.com So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. gtk likes unicode and utf-8: http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html qt likes more options: http://qt-project.org/doc/qt-4.8/qstring.html Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui has a bug i probably misused the enum_split function, but i do not have time to check it :-( ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
2012/3/13 kobi zamir kobi.za...@gmail.com So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. Python's internal representation is not UTF-8, but UTF-16, or UTF-32, depends on build parameters. Thus python doesn't really support code points above the BMP. Of course, you cannot know the internal representation, since python (cleverly) does not allow you to cast a unicode string to a sequence of bytes without specifying the result encoding. http://docs.python.org/c-api/unicode.html (see also this very good presentationhttp://98.245.80.27/tcpc/OSCON2011/gbu.htmlon internal unicode representations in various languages). ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
Hi, 2012/3/13 Elazar Leibovich elaz...@gmail.com 2012/3/13 kobi zamir kobi.za...@gmail.com So I guess that you're also in the UTF-8 camp. yes, but my opinion about utf-8 is just my opinion. i like python and python defaults to utf-8. Python's internal representation is not UTF-8, but UTF-16, or UTF-32, depends on build parameters. Thus python doesn't really support code points above the BMP. Of course, you cannot know the internal representation, since python (cleverly) does not allow you to cast a unicode string to a sequence of bytes without specifying the result encoding. http://docs.python.org/c-api/unicode.html (see also this very good presentationhttp://98.245.80.27/tcpc/OSCON2011/gbu.htmlon internal unicode representations in various languages). Nitpick: It's actually ucs2/ucs4 (which preceded the above but are compatible). Actually one can know the internal representation by checking sys.maxunicode [1]. I'm using it in python-bidi to manually handle surrogate pairs if needed [2]. [1] http://docs.python.org/dev/library/sys.html#sys.maxunicode [2] https://github.com/MeirKriheli/python-bidi/blob/master/src/bidi/algorithm.py#L46 Cheers -- Meir ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Tue, Mar 13, 2012 at 1:19 PM, Meir Kriheli mkrih...@gmail.com wrote: Nitpick: It's actually ucs2/ucs4 (which preceded the above but are compatible). Double nitpick, UTF-16 and UCS-2 are identical representation, and it's better to always use the name UTF-16 as the FAQ sayshttp://www.unicode.org/faq/basic_q.html#14 : UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. *This term should now be avoided.* So I think it's perfectly reasonable to call the internal representation UTF-16. (And since python offer some support for surrogate pairs, at least in string literals, it might even make sense to call it UTF-16). (Sorry, I couldn't help it ;-) ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Mon, Mar 12, 2012 at 03:05:56PM +0200, Nadav Har'El wrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these strings to be in any language? Obviously the answer is Unicode, but that doesn't really answer the question... How is Unicode used in C? As far as I can see, there are two major approaches to this problem. One approach, used in the Win32 C APIs on MS-Windows, and also in Java and other languages, is to use wide characters - characters of 16 or 32 bit size, and strings are an array of such characters. The second approach, proposed by Plan 9, is to use UTF-8. I personally like better the UTF-8 approach, because it naturally fits with C's char * type and with Linux's system calls (which take char*, not any sort of wide characters), but I'm completely unsure that this is what users actually want. If not, then I wonder, why? Some background on this question: People have been complaining for years that Hspell, and in particular the libhspell functions, use ISO-8859-8 instead of unicode. But if one wants to add unicode to libhspell, what should it be? UTF-8? Wide chars (UTF-16 or UTF-32)? I think this background is most important. The real questions is the motivation of the people complaining. If it is something beyond yuck, 8bit is old!, we should ask them which encoding is good for their use case. When I compiled hspell for a (paying!) customer who used Windows, I wrote my own wrapper functions to convert Windows' wide chars to hspell's 8bit (and vice versa). I bet that if there's anyone using libhspell in a Unix-like environment, he would prefer utf-8. In my opinion, it is nice to fit to modern standards of your major target environment (read: utf8), but not necessary to cater to all encodings. Would you even consider supplying a hspell_iso88598_to_utf8 function to help your client app do the conversion itself? I'm not sure this is our bees wax. However this is only me and my bets. If anyone needs another encoding, let him speak now or use his own iconv calls forever. Dan. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Tue, Mar 13, 2012, Dan Kenigsberg wrote about Re: Unicode in C: In my opinion, it is nice to fit to modern standards of your major target environment (read: utf8), but not necessary to cater to all encodings. It appears that the consensus on this list is that UTF-8 is indeed the right way to do Unicode in C on Linux. I'm happy with this consensus, but I just can't help but wonder why I can hardly find evidence for this supposed preference anywhere :( E.g., in Glib's gunicode.h I find UTF-32 characters called gunichar. Fribidi also appears to take (e.g., see fribidi_log2vis(3)) UTF-32 strings. Qt appears to use internally UTF-16. What major free software C library actually prefer UTF-8? -- Nadav Har'El| Tuesday, Mar 13 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |Does replacing myself with a shell-script http://nadav.harel.org.il |make me impressive or insignificant? ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Tue, Mar 13, 2012 at 5:22 PM, Nadav Har'El n...@math.technion.ac.ilwrote: Qt appears to use internally UTF-16. What major free software C library actually prefer UTF-8? Are you talking about the internal representation, or the external interface? The internal representation is in many cases UTF-16. Indeed, except of golang, and so it seems perl, I can't think of any other language open source or not, that has UTF-8 as internal representation. That said, the internal representation should not be exposed to anyone, so it shouldn't really matter to anyone you're using ISO-5589-1 internally, as long as they don't have to convert their text to that arcane format. However, if you look around, a lot of text files, documentation, HTML files, open network wire formats (eg, json) are using UTF-8 as their text encoding format. So in this sense, I think it's a de facto standard. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B) with just a KAF (U+05DB) etc. I guess that you're doing that already to some degree in hspell, so (in case you're translating to ISO-8859-8) you just have to be careful not to miss any letters in the conversion from Unicode. On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these strings to be in any language? Obviously the answer is Unicode, but that doesn't really answer the question... How is Unicode used in C? As far as I can see, there are two major approaches to this problem. One approach, used in the Win32 C APIs on MS-Windows, and also in Java and other languages, is to use wide characters - characters of 16 or 32 bit size, and strings are an array of such characters. The second approach, proposed by Plan 9, is to use UTF-8. I personally like better the UTF-8 approach, because it naturally fits with C's char * type and with Linux's system calls (which take char*, not any sort of wide characters), but I'm completely unsure that this is what users actually want. If not, then I wonder, why? Some background on this question: People have been complaining for years that Hspell, and in particular the libhspell functions, use ISO-8859-8 instead of unicode. But if one wants to add unicode to libhspell, what should it be? UTF-8? Wide chars (UTF-16 or UTF-32)? Thanks, Nadav. -- Nadav Har'El|Monday, Mar 12 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we knew http://nadav.harel.org.il |how to make AOL's Free CD's edible! ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C: Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B) with just a KAF (U+05DB) etc. Is this really important? Does anybody actually use Kaf with Dagesh ? Why does it even exist? :( I noticed there are even more bizarre characters, like HEBREW LETTER ALEF WITH MAPIQ (!?), HEBREW LIGATURE ALEF LAMED, HEBREW LETTER WIDE ALEF, HEBREW LETTER ALEF WITH QAMATS (Is Yiddish called Hebrew now??) HEBREW LETTER ALTERNATIVE AYIN, and other junk. Why do these exit? This is sad. Nadav. -- Nadav Har'El| Tuesday, Mar 13 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |War doesn't determine who's right but http://nadav.harel.org.il |who's left. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Tue, Mar 13, 2012 at 10:16 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C: Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B) with just a KAF (U+05DB) etc. Is this really important? Does anybody actually use Kaf with Dagesh ? Why does it even exist? :( I'm not sure, neither I'm not sure why LOVE HOTEL or JAPANESE GOBLIN exists. When I read those stuff I'm not sure whether to laugh or cry. Most are probably never used, although I need to ask people at the publishing industry, maybe they use special symbols there. Maybe some of the wise folks in the list will enlighten as. However as they say, the Unicode consortium הקדים רפואה למכה, and made standard normalization algorithms which are supposed to solve this problem and convert all text to standard form (I'm not sure if it's really covering all the edge cases though). I'm not sure if the normalization should be included in hspell, however I would put a notice that the input is expected to be normalized in order to work. And I would at least support Niqud. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
Nadav Har'El wrote on Tue, Mar 13, 2012 at 22:16:23 +0200: On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C: Something very important, one need to consider is Unicode normalization. That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH (U+FB3B) with just a KAF (U+05DB) etc. Is this really important? Does anybody actually use Kaf with Dagesh ? Why does it even exist? :( FWIW, Unicode normalization isn't just about ignoring niqud, it's also about having =2 equivalent forms for the same object--- such as é (U+00e9) and ́e (U+0301,U+0065). I'm not sure whether this particular issue applies to Hebrew. Daniel (maybe you knew this already) I noticed there are even more bizarre characters, like HEBREW LETTER ALEF WITH MAPIQ (!?), HEBREW LIGATURE ALEF LAMED, HEBREW LETTER WIDE ALEF, HEBREW LETTER ALEF WITH QAMATS (Is Yiddish called Hebrew now??) HEBREW LETTER ALTERNATIVE AYIN, and other junk. Why do these exit? This is sad. Nadav. -- Nadav Har'El| Tuesday, Mar 13 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |War doesn't determine who's right but http://nadav.harel.org.il |who's left. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
imho: hspell does hebrew spelling well. we have iconv, glib, qt ... for doing encoding conversions well. http://en.wikipedia.org/wiki/Unix_philosophy#McIlroy:_A_Quarter_Century_of_Unix on the other side, it will be very nice to have a utf-8 interface to hspell :-) ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
It depends upon your tradeoffs. If you use mostly Western fonts (Latin, Hebrew, etc.) and want to economize on memory use, use UTF-8. However, for Chinese, it costs more memory than it saves. If you need to use Far Eastern fonts and/or have random access for your text, use fixed size wide character encoding (16 bit or 32 bit size). My suggestion for the particular case of libhspell is as follows. 1. Is there any standard API for spellchecking libraries? If yes, try to use it. 2. Otherwise, specify two such APIs - one is UTF-8 based, one is fixed size wide character based. Create two binary variants of the libhspell and optimize each one for the corresponding API. Hopefully, it'll be possible to use essentially the same code base for 16 bit and 32 bit characters. The rationale is that different wordprocessors may need either API, and that they need to run spellchecking as fast as possible. --- Omer On Mon, 2012-03-12 at 15:05 +0200, Nadav Har'El wrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these strings to be in any language? Obviously the answer is Unicode, but that doesn't really answer the question... How is Unicode used in C? As far as I can see, there are two major approaches to this problem. One approach, used in the Win32 C APIs on MS-Windows, and also in Java and other languages, is to use wide characters - characters of 16 or 32 bit size, and strings are an array of such characters. The second approach, proposed by Plan 9, is to use UTF-8. I personally like better the UTF-8 approach, because it naturally fits with C's char * type and with Linux's system calls (which take char*, not any sort of wide characters), but I'm completely unsure that this is what users actually want. If not, then I wonder, why? Some background on this question: People have been complaining for years that Hspell, and in particular the libhspell functions, use ISO-8859-8 instead of unicode. But if one wants to add unicode to libhspell, what should it be? UTF-8? Wide chars (UTF-16 or UTF-32)? -- $ python type(type(type)) type 'type' My own blog is at http://www.zak.co.il/tddpirate/ My opinions, as expressed in this E-mail message, are mine alone. They do not represent the official policy of any organization with which I may be affiliated in any way. WARNING TO SPAMMERS: at http://www.zak.co.il/spamwarning.html ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Mon, Mar 12, 2012 at 3:20 PM, Omer Zak w...@zak.co.il wrote: If you need to use Far Eastern fonts and/or have random access for your text, use fixed size wide character encoding (16 bit or 32 bit size). Note that UTF-16, doesn't really offer random access, due to surrogate pairs (not all Unicode code points fits into 0..2^16). Although some implementations simply ignore this fact. I humbly suggest you to have a look at https://github.com/elazarl/javaUnicodePitfalls a place I tried to capture some common language pitfalls (despite the name, not everything is unique to java). ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. Do you mind using iconv-like library? On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these strings to be in any language? Obviously the answer is Unicode, but that doesn't really answer the question... How is Unicode used in C? As far as I can see, there are two major approaches to this problem. One approach, used in the Win32 C APIs on MS-Windows, and also in Java and other languages, is to use wide characters - characters of 16 or 32 bit size, and strings are an array of such characters. The second approach, proposed by Plan 9, is to use UTF-8. I personally like better the UTF-8 approach, because it naturally fits with C's char * type and with Linux's system calls (which take char*, not any sort of wide characters), but I'm completely unsure that this is what users actually want. If not, then I wonder, why? Some background on this question: People have been complaining for years that Hspell, and in particular the libhspell functions, use ISO-8859-8 instead of unicode. But if one wants to add unicode to libhspell, what should it be? UTF-8? Wide chars (UTF-16 or UTF-32)? Thanks, Nadav. -- Nadav Har'El|Monday, Mar 12 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we knew http://nadav.harel.org.il |how to make AOL's Free CD's edible! ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
My suggestion is go the glib/gtk approach and use utf-8 everywhere and have the API accept char*, i.e. there is no typedef for a unicode character strings. If this is not acceptable because of speed (this is its only tradeoff), then use UCS-4 internally and provide two external interfaces for UCS-4 and UTF-8. For backwards compatibility you can provide your own iso-8859-8 to utf8 conversion functions. I suggest that you don't add an iconv dependence but let the user take care of character set conversions, which you don't really care about. Regards, Dov 2012/3/12 Elazar Leibovich elaz...@gmail.com The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. Do you mind using iconv-like library? On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these strings to be in any language? Obviously the answer is Unicode, but that doesn't really answer the question... How is Unicode used in C? As far as I can see, there are two major approaches to this problem. One approach, used in the Win32 C APIs on MS-Windows, and also in Java and other languages, is to use wide characters - characters of 16 or 32 bit size, and strings are an array of such characters. The second approach, proposed by Plan 9, is to use UTF-8. I personally like better the UTF-8 approach, because it naturally fits with C's char * type and with Linux's system calls (which take char*, not any sort of wide characters), but I'm completely unsure that this is what users actually want. If not, then I wonder, why? Some background on this question: People have been complaining for years that Hspell, and in particular the libhspell functions, use ISO-8859-8 instead of unicode. But if one wants to add unicode to libhspell, what should it be? UTF-8? Wide chars (UTF-16 or UTF-32)? Thanks, Nadav. -- Nadav Har'El|Monday, Mar 12 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we knew http://nadav.harel.org.il |how to make AOL's Free CD's edible! ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
What's the advantage of using ucs-4 internally? Especially if the program needs to save memory (embedded devices are pretty common these days). Ely 2012/3/12 Dov Grobgeld dov.grobg...@gmail.com My suggestion is go the glib/gtk approach and use utf-8 everywhere and have the API accept char*, i.e. there is no typedef for a unicode character strings. If this is not acceptable because of speed (this is its only tradeoff), then use UCS-4 internally and provide two external interfaces for UCS-4 and UTF-8. For backwards compatibility you can provide your own iso-8859-8 to utf8 conversion functions. I suggest that you don't add an iconv dependence but let the user take care of character set conversions, which you don't really care about. Regards, Dov 2012/3/12 Elazar Leibovich elaz...@gmail.com The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. Do you mind using iconv-like library? On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote: Hi, I have a question that I was sort of sad that I couldn't readily find the answer to... Let's say I want to create a C API (a C library), with functions which take strings as arguments. What am I supposed to use if I want these strings to be in any language? Obviously the answer is Unicode, but that doesn't really answer the question... How is Unicode used in C? As far as I can see, there are two major approaches to this problem. One approach, used in the Win32 C APIs on MS-Windows, and also in Java and other languages, is to use wide characters - characters of 16 or 32 bit size, and strings are an array of such characters. The second approach, proposed by Plan 9, is to use UTF-8. I personally like better the UTF-8 approach, because it naturally fits with C's char * type and with Linux's system calls (which take char*, not any sort of wide characters), but I'm completely unsure that this is what users actually want. If not, then I wonder, why? Some background on this question: People have been complaining for years that Hspell, and in particular the libhspell functions, use ISO-8859-8 instead of unicode. But if one wants to add unicode to libhspell, what should it be? UTF-8? Wide chars (UTF-16 or UTF-32)? Thanks, Nadav. -- Nadav Har'El|Monday, Mar 12 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we knew http://nadav.harel.org.il |how to make AOL's Free CD's edible! ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Mon, Mar 12, 2012 at 5:39 PM, E L elyl...@cs.huji.ac.il wrote: What's the advantage of using ucs-4 internally? Especially if the program needs to save memory (embedded devices are pretty common these days). UTF-32 or UCS-4, is the only encoding form that allows random access to each Unicode codepoint, each codepoint is 32 bits exactly. As I mentioned, UTF-16 was created with the intention of having indexable codepoints, but eventually there were too many of them (eg http://www.fileformat.info/info/unicode/char/1f3e9/index.htm https://plus.google.com/109925364564856140495/posts etc). ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C: The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. This is an option, but certainly not the simplest :-) I thought a simpler option is to support only one encoding... But you're right that with an existing library to do the conversions, it might not be a big problem. Do you mind using iconv-like library? What iconv-like library? I'm not ruling this idea out. But what worries me is that at the end, my users only use 1% of this library's features - e.g., I'll never need this library's support from converting one encoding of Chinese to another. So people who want to use the 50 KB libhspell will suddenly need the 15 MB libicu. -- Nadav Har'El|Monday, Mar 12 2012, n...@math.technion.ac.il |- Phone +972-523-790466, ICQ 13349191 |War doesn't determine who's right but http://nadav.harel.org.il |who's left. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
On Mon, Mar 12, 2012 at 7:37 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C: The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. This is an option, but certainly not the simplest :-) It was the simplest idea *I could think of* at this moment ;p What iconv-like library? iconv-like means, Do you mind using iconv from glibc, and if that's a problem due to support in Windows, embedded systems, etc that do not feature glibc, do you mind having a dependency on other library, such as ICU, or at least something more lightweight like that would handle all UTF-* conversions? I'm not ruling this idea out. But what worries me is that at the end, my users only use 1% of this library's features - e.g., I'll never need this library's support from converting one encoding of Chinese to another. So people who want to use the 50 KB libhspell will suddenly need the 15 MB libicu. At least when using iconv on linux this is not the case. First, this library is available at every distro, and second, iconv is smart enough to split the functionality amongst many .so files, and to dynamically load only the required shared objects at runtime. I'm not sure what's the state of iconv at Windows though. Maybe you can fallback there to native system calls. That said, on a second thought, all the single-byte encoding seems to me more and more deprecated. Thus, I think it might be sufficient to support only UTF-16 and UTF-8. UTF-8 is common at network and files, and UTF-16 is common as inside format in C++ libraries Java and C#, so it's important to support it for easier interoperability with those. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
Re: Unicode in C
enchant use hspell as is (iso-8859-8) and just convert the strings when using the hspell lib: http://www.abisource.com/viewvc/enchant/trunk/src/hspell/hspell_provider.c?view=markup imho because hspell only use hebrew, it can internally continue to use hebrew only charset without nikud iso-8859-8 (or with nikud win-1255). it will be helpful if hspell will give the user convenience functions. this functions will that take utf-8 and return utf-8. the functions will convert the utf-8 to the hebrew only coding that hspell will use internally. p.s. i will be happy if hspell will give easy to use functions for using the library lingual info. in current version of hspell using lingual info is very hard. see: http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala 2012/3/12 Elazar Leibovich elaz...@gmail.com On Mon, Mar 12, 2012 at 7:37 PM, Nadav Har'El n...@math.technion.ac.ilwrote: On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C: The simplest option is, to accept StringPiece-like structure (pointer to buffer + size), and encoding, then to convert the data internally to your encoding (say, ISO-8859-8, replacing illegal characters with whitespace), and convert the other output back. This is an option, but certainly not the simplest :-) It was the simplest idea *I could think of* at this moment ;p What iconv-like library? iconv-like means, Do you mind using iconv from glibc, and if that's a problem due to support in Windows, embedded systems, etc that do not feature glibc, do you mind having a dependency on other library, such as ICU, or at least something more lightweight like that would handle all UTF-* conversions? I'm not ruling this idea out. But what worries me is that at the end, my users only use 1% of this library's features - e.g., I'll never need this library's support from converting one encoding of Chinese to another. So people who want to use the 50 KB libhspell will suddenly need the 15 MB libicu. At least when using iconv on linux this is not the case. First, this library is available at every distro, and second, iconv is smart enough to split the functionality amongst many .so files, and to dynamically load only the required shared objects at runtime. I'm not sure what's the state of iconv at Windows though. Maybe you can fallback there to native system calls. That said, on a second thought, all the single-byte encoding seems to me more and more deprecated. Thus, I think it might be sufficient to support only UTF-16 and UTF-8. UTF-8 is common at network and files, and UTF-16 is common as inside format in C++ libraries Java and C#, so it's important to support it for easier interoperability with those. ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il ___ Linux-il mailing list Linux-il@cs.huji.ac.il http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il