Re: UTF-8 encode/decode libraries.
On 20040426T104946-0700, David Brown wrote: > Is anyone aware of any Haskell libraries for doing UTF-8 decoding and > encoding? If not, I'll write something simple. I wrote a simple Unicode library for my MSc project a couple of years ago. It might not compile with recent GHC, but you can have a look at http://savannah.nongnu.org/cgi-bin/viewcvs/ebba/ebba-h/ebba-unicode/ -- Antti-Juhani Kaijanaho, FM (MSc), http://www.mit.jyu.fi/antkaij/ ohjelmistotekniikan assistentti* assistant in software engineering Jyväskylän yliopisto * University of Jyväskylä Tietotekniikan laitos * Dept. of Mathematical Inf. Tech. signature.asc Description: Digital signature ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode libraries.
On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote: > Duncan Coutts wrote: > >On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] > >toUTF :: String -> String > > Hmmm, "String -> [Word8]" would be nicer... > > >fromUTF :: String -> String > > ... and here: "[Word8] -> String" or "[Word8] -> Maybe String". Except that I would then have to come up with my own IO routines to read and write UTF data. With both sides as string, it is easy to just filter input and output of files. Dave ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode libraries.
Duncan Coutts wrote: On Mon, 2004-04-26 at 18:49, David Brown wrote: [...] toUTF :: String -> String Hmmm, "String -> [Word8]" would be nicer... fromUTF :: String -> String ... and here: "[Word8] -> String" or "[Word8] -> Maybe String". Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character, here an excerpt from "man utf8" on my SuSE Linux: * UTF-8 encoded UCS characters may be up to six bytes long, however the Unicode standard specifies no characters above 0x10, so Unicode characters can only be up to four bytes long in UTF-8. IIRC we discussed encoders/decoders quite some time ago on the libraries mailing list, but nothing really happened, which is a pity. We should strive for something more general than UTF-8 <-> UCS/Unicode, there are quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers? Cheers, S. ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Re: UTF-8 encode/decode libraries.
On Mon, 2004-04-26 at 18:49, David Brown wrote: > Is anyone aware of any Haskell libraries for doing UTF-8 decoding and > encoding? If not, I'll write something simple. The gtk2hs library uses the following functions internally. Credit to Axel Simon I believe unless he swiped them from somewhere too. -- Convert Unicode characters to UTF-8. -- toUTF :: String -> String toUTF [] = [] toUTF (x:xs) | ord x<=0x007F = x:toUTF xs | ord x<=0x07FF = chr (0xC0 .|. ((ord x `shift` (-6)) .&. 0x1F)): chr (0x80 .|. (ord x .&. 0x3F)): toUTF xs | otherwise = chr (0xE0 .|. ((ord x `shift` (-12)) .&. 0x0F)): chr (0x80 .|. ((ord x `shift` (-6)) .&. 0x3F)): chr (0x80 .|. (ord x .&. 0x3F)): toUTF xs -- Convert UTF-8 to Unicode. -- fromUTF :: String -> String fromUTF [] = [] fromUTF (all@(x:xs)) | ord x<=0x7F = x:fromUTF xs | ord x<=0xBF = err | ord x<=0xDF = twoBytes all | ord x<=0xEF = threeBytes all | otherwise = err where twoBytes (x1:x2:xs) = chr (((ord x1 .&. 0x1F) `shift` 6) .|. (ord x2 .&. 0x3F)):fromUTF xs twoBytes _ = error "fromUTF: illegal two byte sequence" threeBytes (x1:x2:x3:xs) = chr (((ord x1 .&. 0x0F) `shift` 12) .|. ((ord x2 .&. 0x3F) `shift` 6) .|. (ord x3 .&. 0x3F)):fromUTF xs threeBytes _ = error "fromUTF: illegal three byte sequence" err = error "fromUTF: illegal UTF-8 character" Duncan ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
UTF-8 encode/decode libraries.
I am writing some utilities to deal with UTF-8 encoded text files (not source). Currently, I'm just reading in the UTF-8 directly, and things work reasonably well, since my parse tokens are ASCII, they are easy to parse. However, the character type seems perfectly happy with larger values for each character. Is anyone aware of any Haskell libraries for doing UTF-8 decoding and encoding? If not, I'll write something simple. Thanks, Dave Brown ___ Glasgow-haskell-users mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/glasgow-haskell-users