subject:"Re\: Unicode in C"

Re: Unicode in C

2012-03-13 Thread Nadav Har'El

On Mon, Mar 12, 2012, Omer Zak wrote about Re: Unicode in C:
 It depends upon your tradeoffs.
...
 2. Otherwise, specify two such APIs - one is UTF-8 based, one is fixed
 size wide character based.  Create two binary variants of the libhspell
...

This is why I asked this question in the first place - I'm aware of the
tradeoffs, and the possibility two create two variants for every
function (or three, if you include the existing ISO-8859-8 API).

I was just wondering - could it be that 20 years (!) after UTF-8 was
invented for use in Plan 9 to counter wide characters, that neither
method has won?

As I see it, UTF-8 vs. wide characters (or UTF-16, or UTF-32, which
are all similar for my needs) is a big-endian/little-endian kind of
issue, where it's possible to list all sorts of advantages to each one,
but at the end, each of the choices is good and has a large number of
followers, and a choice has to be made. Continuing to use all of these
approaches is not a good thing, as I see it. Even if in practice, I
can write all these APIs with not too much effort, I think it's ugly.

-- 
Nadav Har'El|   Tuesday, Mar 13 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Live as if you were to die tomorrow,
http://nadav.harel.org.il   |learn as if you were to live forever.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Nadav Har'El

On Tue, Mar 13, 2012, kobi zamir wrote about Re: Unicode in C:
 imho because hspell only use hebrew, it can internally continue to use
 hebrew only charset without nikud iso-8859-8 (or with nikud win-1255).

I agree, and this has been my feeling all along. By using iso-8859-8
internally (and for the basic word lookup, an even more optimized 5-bit
encoding) instead of utf-8, Hspell's memory usage is at least halved.

 it will be helpful if hspell will give the user convenience functions. this
 functions will that take utf-8 and return utf-8. the functions will convert
 the utf-8 to the hebrew only coding that hspell will use internally.

So I guess that you're also in the UTF-8 camp. That's also the direction
I'm leaning. But the question is - will one day after Hspell gets a
UTF-8 API, people start complaining why it doesn't have a UTF-16,
UTF-32, or some other sort of API? And don't answer if they want
UTF-16, let them use iconv to convert UTF-16 to UTF-8 and back - after
all they can do this now with ISO-8859-8 (and like you said, Enchant is
doing exactly that) and still people complain ;-)

 p.s.
 i will be happy if hspell will give easy to use functions for using the
 library lingual info. in current version of hspell using lingual info is
 very hard. see:
 http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala

I agree that the linginfo (aka morphological analyzer) C API needs an
overhaul. Out of embarrasment, it's not even documented in hspell(3) :-)
It could also have been implemented more efficiently (memory-wise) than
it is. But following the maxim If it ain't broken, don't fix it,
we haven't touched this code in years :(

P.S. 

Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui
has a bug: it claims that החתול might mean ה+חתול with the second word
being in construct form (סמיכוך). But this isn't a valid split - the
construct form cannot be preceded by the definite article (ה) - and
Hspell knows this (try running hspell -al or going to the demo at
http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi to check).
Similarly, הירוק only has one legal meaning (the green) and the two other
meanings listed in the png on your site are *wrong*. So it appears something
is wrong with your word splitting code? This is surprising if you're using
libhspell... I didn't look at your code to see where it went wrong.

Nadav.

-- 
Nadav Har'El|   Tuesday, Mar 13 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |And now for some feedback:
http://nadav.harel.org.il   |EEE

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread kobi zamir


 So I guess that you're also in the UTF-8 camp.


yes, but my opinion about utf-8 is just my opinion. i like python and
python defaults to utf-8.

gtk likes unicode and utf-8:
http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
qt likes more options:
http://qt-project.org/doc/qt-4.8/qstring.html

Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui
 has a bug


i probably misused the enum_split function, but i do not have time to check
it :-(
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Ely Levy

I don't think that input/output matters so much,
In something like hspell I/O should be modular so later on encoding can be
added.
After all it already has function to translate to/from internal
representation.
I believe that iso-8859-8 and utf8 should be good enough for starts.

Ely

2012/3/13 kobi zamir kobi.za...@gmail.com



 So I guess that you're also in the UTF-8 camp.


 yes, but my opinion about utf-8 is just my opinion. i like python and
 python defaults to utf-8.

 gtk likes unicode and utf-8:
 http://www.gtk.org/api/2.6/glib/glib-Unicode-Manipulation.html
 qt likes more options:
 http://qt-project.org/doc/qt-4.8/qstring.html

 Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui
 has a bug


 i probably misused the enum_split function, but i do not have time to
 check it :-(

 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich

2012/3/13 kobi zamir kobi.za...@gmail.com



 So I guess that you're also in the UTF-8 camp.


 yes, but my opinion about utf-8 is just my opinion. i like python and
 python defaults to utf-8.


Python's internal representation is not UTF-8, but UTF-16, or UTF-32,
depends on build parameters. Thus python doesn't really support code points
above the BMP.
Of course, you cannot know the internal representation, since python
(cleverly) does not allow you to cast a unicode string to a sequence of
bytes without specifying the result encoding.

http://docs.python.org/c-api/unicode.html

(see also this very good
presentationhttp://98.245.80.27/tcpc/OSCON2011/gbu.htmlon internal
unicode representations in various languages).
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Meir Kriheli

Hi,

2012/3/13 Elazar Leibovich elaz...@gmail.com

 2012/3/13 kobi zamir kobi.za...@gmail.com



 So I guess that you're also in the UTF-8 camp.


 yes, but my opinion about utf-8 is just my opinion. i like python and
 python defaults to utf-8.


 Python's internal representation is not UTF-8, but UTF-16, or UTF-32,
 depends on build parameters. Thus python doesn't really support code points
 above the BMP.
 Of course, you cannot know the internal representation, since python
 (cleverly) does not allow you to cast a unicode string to a sequence of
 bytes without specifying the result encoding.

 http://docs.python.org/c-api/unicode.html

 (see also this very good 
 presentationhttp://98.245.80.27/tcpc/OSCON2011/gbu.htmlon internal unicode 
 representations in various languages).


Nitpick: It's actually ucs2/ucs4 (which preceded the above but are
compatible).

Actually one can know the internal representation by checking
sys.maxunicode [1]. I'm using it in python-bidi to manually handle
surrogate pairs if needed [2].

[1] http://docs.python.org/dev/library/sys.html#sys.maxunicode
[2]
https://github.com/MeirKriheli/python-bidi/blob/master/src/bidi/algorithm.py#L46

Cheers
-- 
Meir
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich

On Tue, Mar 13, 2012 at 1:19 PM, Meir Kriheli mkrih...@gmail.com wrote:


 Nitpick: It's actually ucs2/ucs4 (which preceded the above but are
 compatible).


Double nitpick, UTF-16 and UCS-2 are identical representation, and it's
better to always use the name UTF-16 as the FAQ
sayshttp://www.unicode.org/faq/basic_q.html#14
:

UCS-2 is obsolete terminology which refers to a Unicode implementation up
 to Unicode 1.1, before surrogate code points and UTF-16 were added to
 Version 2.0 of the standard. *This term should now be avoided.*


So I think it's perfectly reasonable to call the internal representation
UTF-16.
(And since python offer some support for surrogate pairs, at least in
string literals, it might even make sense to call it UTF-16).

(Sorry, I couldn't help it ;-)
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Dan Kenigsberg

On Mon, Mar 12, 2012 at 03:05:56PM +0200, Nadav Har'El wrote:
 Hi, I have a question that I was sort of sad that I couldn't readily
 find the answer to...
 
 Let's say I want to create a C API (a C library), with functions which
 take strings as arguments. What am I supposed to use if I want these strings
 to be in any language? Obviously the answer is Unicode, but that
 doesn't really answer the question... How is Unicode used in C?
 
 As far as I can see, there are two major approaches to this problem.
 
 One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
 other languages, is to use wide characters - characters of 16 or 32 bit
 size, and strings are an array of such characters.
 
 The second approach, proposed by Plan 9, is to use UTF-8.
 
 I personally like better the UTF-8 approach, because it naturally fits
 with C's char * type and with Linux's system calls (which take char*,
 not any sort of wide characters), but I'm completely unsure that this is
 what users actually want. If not, then I wonder, why?
 
 Some background on this question: People have been complaining for years
 that Hspell, and in particular the libhspell functions, use ISO-8859-8
 instead of unicode. But if one wants to add unicode to libhspell, what
 should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

I think this background is most important. The real questions is the motivation
of the people complaining. If it is something beyond yuck, 8bit is old!, we
should ask them which encoding is good for their use case.

When I compiled hspell for a (paying!) customer who used Windows, I wrote my own
wrapper functions to convert Windows' wide chars to hspell's 8bit (and vice
versa). I bet that if there's anyone using libhspell in a Unix-like environment,
he would prefer utf-8.

In my opinion, it is nice to fit to modern standards of your major target
environment (read: utf8), but not necessary to cater to all encodings.
Would you even consider supplying a hspell_iso88598_to_utf8 function to help
your client app do the conversion itself? I'm not sure this is our bees wax.

However this is only me and my bets. If anyone needs another encoding, let him
speak now or use his own iconv calls forever.


Dan.


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Nadav Har'El

On Tue, Mar 13, 2012, Dan Kenigsberg wrote about Re: Unicode in C:
 In my opinion, it is nice to fit to modern standards of your major target
 environment (read: utf8), but not necessary to cater to all encodings.

It appears that the consensus on this list is that UTF-8 is indeed the
right way to do Unicode in C on Linux. I'm happy with this consensus,
but I just can't help but wonder why I can hardly find evidence for this
supposed preference anywhere :(

E.g., in Glib's gunicode.h I find UTF-32 characters called gunichar.
Fribidi also appears to take (e.g., see fribidi_log2vis(3)) UTF-32 strings.
Qt appears to use internally UTF-16. What major free software C library
actually prefer UTF-8?

-- 
Nadav Har'El|   Tuesday, Mar 13 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |Does replacing myself with a shell-script
http://nadav.harel.org.il   |make me impressive or insignificant?

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich

On Tue, Mar 13, 2012 at 5:22 PM, Nadav Har'El n...@math.technion.ac.ilwrote:


 Qt appears to use internally UTF-16. What major free software C library
 actually prefer UTF-8?


Are you talking about the internal representation, or the external
interface?

The internal representation is in many cases UTF-16. Indeed, except of
golang, and so it seems perl, I can't think of any other language open
source or not, that has UTF-8 as internal representation.

That said, the internal representation should not be exposed to anyone, so
it shouldn't really matter to anyone you're using ISO-5589-1 internally, as
long as they don't have to convert their text to that arcane format.

However, if you look around, a lot of text files, documentation, HTML
files, open network wire formats (eg, json) are using UTF-8 as their text
encoding format. So in this sense, I think it's a de facto standard.
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich

Something very important, one need to consider is Unicode normalization.
That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH
(U+FB3B) with just a KAF (U+05DB) etc.

I guess that you're doing that already to some degree in hspell, so (in
case you're translating to ISO-8859-8) you just have to be careful not to
miss any letters in the conversion from Unicode.

On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 Hi, I have a question that I was sort of sad that I couldn't readily
 find the answer to...

 Let's say I want to create a C API (a C library), with functions which
 take strings as arguments. What am I supposed to use if I want these
 strings
 to be in any language? Obviously the answer is Unicode, but that
 doesn't really answer the question... How is Unicode used in C?

 As far as I can see, there are two major approaches to this problem.

 One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
 other languages, is to use wide characters - characters of 16 or 32 bit
 size, and strings are an array of such characters.

 The second approach, proposed by Plan 9, is to use UTF-8.

 I personally like better the UTF-8 approach, because it naturally fits
 with C's char * type and with Linux's system calls (which take char*,
 not any sort of wide characters), but I'm completely unsure that this is
 what users actually want. If not, then I wonder, why?

 Some background on this question: People have been complaining for years
 that Hspell, and in particular the libhspell functions, use ISO-8859-8
 instead of unicode. But if one wants to add unicode to libhspell, what
 should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

 Thanks,
 Nadav.

 --
 Nadav Har'El|Monday, Mar 12
 2012,
 n...@math.technion.ac.il
 |-
 Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we
 knew
 http://nadav.harel.org.il   |how to make AOL's Free CD's edible!

 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Nadav Har'El

On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C:
 Something very important, one need to consider is Unicode normalization.
 That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH
 (U+FB3B) with just a KAF (U+05DB) etc.

Is this really important? Does anybody actually use Kaf with Dagesh ?
Why does it even exist? :(

I noticed there are even more bizarre characters, like HEBREW LETTER
ALEF WITH MAPIQ (!?), HEBREW LIGATURE ALEF LAMED, HEBREW LETTER WIDE
ALEF, HEBREW LETTER ALEF WITH QAMATS (Is Yiddish called Hebrew now??)
HEBREW LETTER ALTERNATIVE AYIN, and other junk. Why do these exit?
This is sad.

Nadav.


-- 
Nadav Har'El|   Tuesday, Mar 13 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |War doesn't determine who's right but
http://nadav.harel.org.il   |who's left.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Elazar Leibovich

On Tue, Mar 13, 2012 at 10:16 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C:
  Something very important, one need to consider is Unicode normalization.
  That is, how to strip out the Niqud, and to substitute, say KAF WITH
 DAGESH
  (U+FB3B) with just a KAF (U+05DB) etc.

 Is this really important? Does anybody actually use Kaf with Dagesh ?
 Why does it even exist? :(


I'm not sure, neither I'm not sure why LOVE HOTEL or JAPANESE GOBLIN
exists. When I read those stuff I'm not sure whether to laugh or cry. Most
are probably never used, although I need to ask people at the publishing
industry, maybe they use special symbols there. Maybe some of the wise
folks in the list will enlighten as.
However as they say, the Unicode consortium הקדים רפואה למכה, and made
standard normalization algorithms which are supposed to solve this problem
and convert all text to standard form (I'm not sure if it's really covering
all the edge cases though).

I'm not sure if the normalization should be included in hspell, however I
would put a notice that the input is expected to be normalized in order to
work. And I would at least support Niqud.
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread Daniel Shahaf

Nadav Har'El wrote on Tue, Mar 13, 2012 at 22:16:23 +0200:
 On Tue, Mar 13, 2012, Elazar Leibovich wrote about Re: Unicode in C:
  Something very important, one need to consider is Unicode normalization.
  That is, how to strip out the Niqud, and to substitute, say KAF WITH DAGESH
  (U+FB3B) with just a KAF (U+05DB) etc.
 
 Is this really important? Does anybody actually use Kaf with Dagesh ?
 Why does it even exist? :(
 

FWIW, Unicode normalization isn't just about ignoring niqud, it's also
about having =2 equivalent forms for the same object--- such as
é (U+00e9) and ́e (U+0301,U+0065).

I'm not sure whether this particular issue applies to Hebrew.

Daniel
(maybe you knew this already)

 I noticed there are even more bizarre characters, like HEBREW LETTER
 ALEF WITH MAPIQ (!?), HEBREW LIGATURE ALEF LAMED, HEBREW LETTER WIDE
 ALEF, HEBREW LETTER ALEF WITH QAMATS (Is Yiddish called Hebrew now??)
 HEBREW LETTER ALTERNATIVE AYIN, and other junk. Why do these exit?
 This is sad.
 
 Nadav.
 
 
 -- 
 Nadav Har'El|   Tuesday, Mar 13 2012, 
 n...@math.technion.ac.il 
 |-
 Phone +972-523-790466, ICQ 13349191 |War doesn't determine who's right but
 http://nadav.harel.org.il   |who's left.
 
 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-13 Thread kobi zamir

imho: hspell does hebrew spelling well.
we have iconv, glib, qt ... for doing encoding conversions well.

http://en.wikipedia.org/wiki/Unix_philosophy#McIlroy:_A_Quarter_Century_of_Unix

on the other side, it will be very nice to have a utf-8 interface to hspell
:-)
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Omer Zak

It depends upon your tradeoffs.
If you use mostly Western fonts (Latin, Hebrew, etc.) and want to
economize on memory use, use UTF-8.  However, for Chinese, it costs more
memory than it saves.

If you need to use Far Eastern fonts and/or have random access for your
text, use fixed size wide character encoding (16 bit or 32 bit size).

My suggestion for the particular case of libhspell is as follows.
1. Is there any standard API for spellchecking libraries?  If yes, try
to use it.
2. Otherwise, specify two such APIs - one is UTF-8 based, one is fixed
size wide character based.  Create two binary variants of the libhspell
and optimize each one for the corresponding API.  Hopefully, it'll be
possible to use essentially the same code base for 16 bit and 32 bit
characters.

The rationale is that different wordprocessors may need either API, and
that they need to run spellchecking as fast as possible.


--- Omer


On Mon, 2012-03-12 at 15:05 +0200, Nadav Har'El wrote:
 Hi, I have a question that I was sort of sad that I couldn't readily
 find the answer to...
 
 Let's say I want to create a C API (a C library), with functions which
 take strings as arguments. What am I supposed to use if I want these strings
 to be in any language? Obviously the answer is Unicode, but that
 doesn't really answer the question... How is Unicode used in C?
 
 As far as I can see, there are two major approaches to this problem.
 
 One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
 other languages, is to use wide characters - characters of 16 or 32 bit
 size, and strings are an array of such characters.
 
 The second approach, proposed by Plan 9, is to use UTF-8.
 
 I personally like better the UTF-8 approach, because it naturally fits
 with C's char * type and with Linux's system calls (which take char*,
 not any sort of wide characters), but I'm completely unsure that this is
 what users actually want. If not, then I wonder, why?
 
 Some background on this question: People have been complaining for years
 that Hspell, and in particular the libhspell functions, use ISO-8859-8
 instead of unicode. But if one wants to add unicode to libhspell, what
 should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

-- 
$ python
 type(type(type))
type 'type'  My own blog is at http://www.zak.co.il/tddpirate/
My opinions, as expressed in this E-mail message, are mine alone.
They do not represent the official policy of any organization with which
I may be affiliated in any way.
WARNING TO SPAMMERS:  at http://www.zak.co.il/spamwarning.html


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich

On Mon, Mar 12, 2012 at 3:20 PM, Omer Zak w...@zak.co.il wrote:


 If you need to use Far Eastern fonts and/or have random access for your
 text, use fixed size wide character encoding (16 bit or 32 bit size).


Note that UTF-16, doesn't really offer random access, due to surrogate
pairs (not all Unicode code points fits into 0..2^16). Although some
implementations simply ignore this fact.

I humbly suggest you to have a look at
https://github.com/elazarl/javaUnicodePitfalls a place I tried to capture
some common language pitfalls (despite the name, not everything is unique
to java).
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich

The simplest option is, to accept StringPiece-like structure (pointer to
buffer + size), and encoding, then to convert the data internally to your
encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
and convert the other output back.

Do you mind using iconv-like library?

On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 Hi, I have a question that I was sort of sad that I couldn't readily
 find the answer to...

 Let's say I want to create a C API (a C library), with functions which
 take strings as arguments. What am I supposed to use if I want these
 strings
 to be in any language? Obviously the answer is Unicode, but that
 doesn't really answer the question... How is Unicode used in C?

 As far as I can see, there are two major approaches to this problem.

 One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
 other languages, is to use wide characters - characters of 16 or 32 bit
 size, and strings are an array of such characters.

 The second approach, proposed by Plan 9, is to use UTF-8.

 I personally like better the UTF-8 approach, because it naturally fits
 with C's char * type and with Linux's system calls (which take char*,
 not any sort of wide characters), but I'm completely unsure that this is
 what users actually want. If not, then I wonder, why?

 Some background on this question: People have been complaining for years
 that Hspell, and in particular the libhspell functions, use ISO-8859-8
 instead of unicode. But if one wants to add unicode to libhspell, what
 should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

 Thanks,
 Nadav.

 --
 Nadav Har'El|Monday, Mar 12
 2012,
 n...@math.technion.ac.il
 |-
 Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if we
 knew
 http://nadav.harel.org.il   |how to make AOL's Free CD's edible!

 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Dov Grobgeld

My suggestion is go the glib/gtk approach and use utf-8 everywhere and have
the API accept char*, i.e. there is no typedef for a unicode character
strings. If this is not acceptable because of speed (this is its only
tradeoff), then use UCS-4 internally and provide two external interfaces
for UCS-4 and UTF-8. For backwards compatibility you can provide your own
iso-8859-8 to utf8 conversion functions. I suggest that you don't add an
iconv dependence but let the user take care of character set conversions,
which you don't really care about.

Regards,
Dov

2012/3/12 Elazar Leibovich elaz...@gmail.com

 The simplest option is, to accept StringPiece-like structure (pointer to
 buffer + size), and encoding, then to convert the data internally to your
 encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
 and convert the other output back.

 Do you mind using iconv-like library?


 On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 Hi, I have a question that I was sort of sad that I couldn't readily
 find the answer to...

 Let's say I want to create a C API (a C library), with functions which
 take strings as arguments. What am I supposed to use if I want these
 strings
 to be in any language? Obviously the answer is Unicode, but that
 doesn't really answer the question... How is Unicode used in C?

 As far as I can see, there are two major approaches to this problem.

 One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
 other languages, is to use wide characters - characters of 16 or 32 bit
 size, and strings are an array of such characters.

 The second approach, proposed by Plan 9, is to use UTF-8.

 I personally like better the UTF-8 approach, because it naturally fits
 with C's char * type and with Linux's system calls (which take char*,
 not any sort of wide characters), but I'm completely unsure that this is
 what users actually want. If not, then I wonder, why?

 Some background on this question: People have been complaining for years
 that Hspell, and in particular the libhspell functions, use ISO-8859-8
 instead of unicode. But if one wants to add unicode to libhspell, what
 should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

 Thanks,
 Nadav.

 --
 Nadav Har'El|Monday, Mar 12
 2012,
 n...@math.technion.ac.il
 |-
 Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if
 we knew
 http://nadav.harel.org.il   |how to make AOL's Free CD's edible!

 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il



 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Ely Levy

What's the advantage of using ucs-4 internally?
Especially if the program needs to save memory (embedded devices are pretty
common these days).

Ely

2012/3/12 Dov Grobgeld dov.grobg...@gmail.com

 My suggestion is go the glib/gtk approach and use utf-8 everywhere and
 have the API accept char*, i.e. there is no typedef for a unicode character
 strings. If this is not acceptable because of speed (this is its only
 tradeoff), then use UCS-4 internally and provide two external interfaces
 for UCS-4 and UTF-8. For backwards compatibility you can provide your own
 iso-8859-8 to utf8 conversion functions. I suggest that you don't add an
 iconv dependence but let the user take care of character set conversions,
 which you don't really care about.

 Regards,
 Dov

 2012/3/12 Elazar Leibovich elaz...@gmail.com

 The simplest option is, to accept StringPiece-like structure (pointer to
 buffer + size), and encoding, then to convert the data internally to your
 encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
 and convert the other output back.

 Do you mind using iconv-like library?


 On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El 
 n...@math.technion.ac.ilwrote:

 Hi, I have a question that I was sort of sad that I couldn't readily
 find the answer to...

 Let's say I want to create a C API (a C library), with functions which
 take strings as arguments. What am I supposed to use if I want these
 strings
 to be in any language? Obviously the answer is Unicode, but that
 doesn't really answer the question... How is Unicode used in C?

 As far as I can see, there are two major approaches to this problem.

 One approach, used in the Win32 C APIs on MS-Windows, and also in Java
 and
 other languages, is to use wide characters - characters of 16 or 32 bit
 size, and strings are an array of such characters.

 The second approach, proposed by Plan 9, is to use UTF-8.

 I personally like better the UTF-8 approach, because it naturally fits
 with C's char * type and with Linux's system calls (which take char*,
 not any sort of wide characters), but I'm completely unsure that this is
 what users actually want. If not, then I wonder, why?

 Some background on this question: People have been complaining for years
 that Hspell, and in particular the libhspell functions, use ISO-8859-8
 instead of unicode. But if one wants to add unicode to libhspell, what
 should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

 Thanks,
 Nadav.

 --
 Nadav Har'El|Monday, Mar 12
 2012,
 n...@math.technion.ac.il
 |-
 Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if
 we knew
 http://nadav.harel.org.il   |how to make AOL's Free CD's edible!

 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il



 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il



 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich

On Mon, Mar 12, 2012 at 5:39 PM, E L elyl...@cs.huji.ac.il wrote:

 What's the advantage of using ucs-4 internally?
 Especially if the program needs to save memory (embedded devices are
 pretty common these days).


UTF-32 or UCS-4, is the only encoding form that allows random access to
each Unicode codepoint, each codepoint is 32 bits exactly. As I mentioned,
UTF-16 was created with the intention of having indexable codepoints, but
eventually there were too many of them (eg
http://www.fileformat.info/info/unicode/char/1f3e9/index.htm
https://plus.google.com/109925364564856140495/posts etc).
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Nadav Har'El

On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C:
 The simplest option is, to accept StringPiece-like structure (pointer to
 buffer + size), and encoding, then to convert the data internally to your
 encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
 and convert the other output back.

This is an option, but certainly not the simplest :-)
I thought a simpler option is to support only one encoding...

But you're right that with an existing library to do the conversions, it
might not be a big problem.

 Do you mind using iconv-like library?

What iconv-like library?

I'm not ruling this idea out. But what worries me is that at the end,
my users only use 1% of this library's features - e.g., I'll never need
this library's support from converting one encoding of Chinese to
another. So people who want to use the 50 KB libhspell will suddenly need
the 15 MB libicu.

-- 
Nadav Har'El|Monday, Mar 12 2012, 
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |War doesn't determine who's right but
http://nadav.harel.org.il   |who's left.

___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread Elazar Leibovich

On Mon, Mar 12, 2012 at 7:37 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C:
  The simplest option is, to accept StringPiece-like structure (pointer to
  buffer + size), and encoding, then to convert the data internally to your
  encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
  and convert the other output back.

 This is an option, but certainly not the simplest :-)


It was the simplest idea *I could think of* at this moment ;p



 What iconv-like library?


iconv-like means, Do you mind using iconv from glibc, and if that's a
problem due to support in Windows, embedded systems, etc that do not
feature glibc, do you mind having a dependency on other library, such as
ICU, or at least something more lightweight like that would handle all
UTF-* conversions?



 I'm not ruling this idea out. But what worries me is that at the end,
 my users only use 1% of this library's features - e.g., I'll never need
 this library's support from converting one encoding of Chinese to
 another. So people who want to use the 50 KB libhspell will suddenly need
 the 15 MB libicu.


At least when using iconv on linux this is not the case. First, this
library is available at every distro, and second, iconv is smart enough to
split the functionality amongst many .so files, and to dynamically load
only the required shared objects at runtime. I'm not sure what's the state
of iconv at Windows though. Maybe you can fallback there to native system
calls.

That said, on a second thought, all the single-byte encoding seems to me
more and more deprecated. Thus, I think it might be sufficient to support
only UTF-16 and UTF-8. UTF-8 is common at network and files, and UTF-16 is
common as inside format in C++ libraries Java and C#, so it's important to
support it for easier interoperability with those.
___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

2012-03-12 Thread kobi zamir

enchant use hspell as is (iso-8859-8) and just convert the strings when
using the hspell lib:
http://www.abisource.com/viewvc/enchant/trunk/src/hspell/hspell_provider.c?view=markup

imho because hspell only use hebrew, it can internally continue to use
hebrew only charset without nikud iso-8859-8 (or with nikud win-1255).

it will be helpful if hspell will give the user convenience functions. this
functions will that take utf-8 and return utf-8. the functions will convert
the utf-8 to the hebrew only coding that hspell will use internally.

p.s.
i will be happy if hspell will give easy to use functions for using the
library lingual info. in current version of hspell using lingual info is
very hard. see:
http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala

2012/3/12 Elazar Leibovich elaz...@gmail.com

 On Mon, Mar 12, 2012 at 7:37 PM, Nadav Har'El n...@math.technion.ac.ilwrote:

 On Mon, Mar 12, 2012, Elazar Leibovich wrote about Re: Unicode in C:
  The simplest option is, to accept StringPiece-like structure (pointer to
  buffer + size), and encoding, then to convert the data internally to
 your
  encoding (say, ISO-8859-8, replacing illegal characters with
 whitespace),
  and convert the other output back.

 This is an option, but certainly not the simplest :-)


 It was the simplest idea *I could think of* at this moment ;p



 What iconv-like library?


 iconv-like means, Do you mind using iconv from glibc, and if that's a
 problem due to support in Windows, embedded systems, etc that do not
 feature glibc, do you mind having a dependency on other library, such as
 ICU, or at least something more lightweight like that would handle all
 UTF-* conversions?



 I'm not ruling this idea out. But what worries me is that at the end,
 my users only use 1% of this library's features - e.g., I'll never need
 this library's support from converting one encoding of Chinese to
 another. So people who want to use the 50 KB libhspell will suddenly need
 the 15 MB libicu.


 At least when using iconv on linux this is not the case. First, this
 library is available at every distro, and second, iconv is smart enough to
 split the functionality amongst many .so files, and to dynamically load
 only the required shared objects at runtime. I'm not sure what's the state
 of iconv at Windows though. Maybe you can fallback there to native system
 calls.

 That said, on a second thought, all the single-byte encoding seems to me
 more and more deprecated. Thus, I think it might be sufficient to support
 only UTF-16 and UTF-8. UTF-8 is common at network and files, and UTF-16 is
 common as inside format in C++ libraries Java and C#, so it's important to
 support it for easier interoperability with those.

 ___
 Linux-il mailing list
 Linux-il@cs.huji.ac.il
 http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il


___
Linux-il mailing list
Linux-il@cs.huji.ac.il
http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

Re: Unicode in C

24 matches

Site Navigation

Mail list logo

Footer information