Re: UTF-8 encode/decode libraries.

2004-05-05 Thread Antti-Juhani Kaijanaho
On 20040426T104946-0700, David Brown wrote:
> Is anyone aware of any Haskell libraries for doing UTF-8 decoding and
> encoding?  If not, I'll write something simple.

I wrote a simple Unicode library for my MSc project a couple of
years ago.  It might not compile with recent GHC, but you can have a
look at 
http://savannah.nongnu.org/cgi-bin/viewcvs/ebba/ebba-h/ebba-unicode/

-- 
Antti-Juhani Kaijanaho, FM (MSc), http://www.mit.jyu.fi/antkaij/
ohjelmistotekniikan assistentti* assistant in software engineering
Jyväskylän yliopisto   * University of Jyväskylä
Tietotekniikan laitos  * Dept. of Mathematical Inf. Tech.


signature.asc
Description: Digital signature
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode libraries.

2004-04-26 Thread David Brown
On Mon, Apr 26, 2004 at 08:33:38PM +0200, Sven Panne wrote:
> Duncan Coutts wrote:
> >On Mon, 2004-04-26 at 18:49, David Brown wrote: [...]
> >toUTF :: String -> String
> 
> Hmmm, "String -> [Word8]" would be nicer...
> 
> >fromUTF :: String -> String
> 
> ... and here: "[Word8] -> String" or "[Word8] -> Maybe String".

Except that I would then have to come up with my own IO routines to read
and write UTF data.  With both sides as string, it is easy to just
filter input and output of files.

Dave
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode libraries.

2004-04-26 Thread Sven Panne
Duncan Coutts wrote:
On Mon, 2004-04-26 at 18:49, David Brown wrote: [...]
toUTF :: String -> String
Hmmm, "String -> [Word8]" would be nicer...

fromUTF :: String -> String
... and here: "[Word8] -> String" or "[Word8] -> Maybe String".
Furthermore, UTF-8 is not restricted to a maximum of 3 bytes per character,
here an excerpt from "man utf8" on my SuSE Linux:
   * UTF-8  encoded  UCS  characters  may  be up to six bytes
 long, however the Unicode standard specifies no  characters­
 above  0x10, so Unicode characters can only be up to
 four bytes long in UTF-8.
IIRC we discussed encoders/decoders quite some time ago on the libraries
mailing list, but nothing really happened, which is a pity. We should
strive for something more general than UTF-8 <-> UCS/Unicode, there are
quite a few more widely used encodings, e.g. GSM 03.38, etc. Any takers?
Cheers,
   S.
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


Re: UTF-8 encode/decode libraries.

2004-04-26 Thread Duncan Coutts
On Mon, 2004-04-26 at 18:49, David Brown wrote:
> Is anyone aware of any Haskell libraries for doing UTF-8 decoding and
> encoding?  If not, I'll write something simple.

The gtk2hs library uses the following functions internally.
Credit to Axel Simon I believe unless he swiped them from somewhere too.

-- Convert Unicode characters to UTF-8.
--
toUTF :: String -> String
toUTF [] = []
toUTF (x:xs) | ord x<=0x007F = x:toUTF xs
 | ord x<=0x07FF = chr (0xC0 .|. ((ord x `shift` (-6)) .&. 0x1F)):
   chr (0x80 .|. (ord x .&. 0x3F)):
   toUTF xs
 | otherwise = chr (0xE0 .|. ((ord x `shift` (-12)) .&. 0x0F)):
   chr (0x80 .|. ((ord x `shift` (-6)) .&. 0x3F)):
   chr (0x80 .|. (ord x .&. 0x3F)):
   toUTF xs

-- Convert UTF-8 to Unicode.
--
fromUTF :: String -> String
fromUTF [] = []
fromUTF (all@(x:xs)) | ord x<=0x7F = x:fromUTF xs
 | ord x<=0xBF = err
 | ord x<=0xDF = twoBytes all
 | ord x<=0xEF = threeBytes all
 | otherwise   = err
  where
twoBytes (x1:x2:xs) = chr (((ord x1 .&. 0x1F) `shift` 6) .|.
   (ord x2 .&. 0x3F)):fromUTF xs
twoBytes _ = error "fromUTF: illegal two byte sequence"

threeBytes (x1:x2:x3:xs) = chr (((ord x1 .&. 0x0F) `shift` 12) .|.
((ord x2 .&. 0x3F) `shift` 6) .|.
(ord x3 .&. 0x3F)):fromUTF xs
threeBytes _ = error "fromUTF: illegal three byte sequence"

err = error "fromUTF: illegal UTF-8 character"

Duncan

___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


UTF-8 encode/decode libraries.

2004-04-26 Thread David Brown
I am writing some utilities to deal with UTF-8 encoded text files (not
source).  Currently, I'm just reading in the UTF-8 directly, and things
work reasonably well, since my parse tokens are ASCII, they are easy to
parse.

However, the character type seems perfectly happy with larger values for
each character.

Is anyone aware of any Haskell libraries for doing UTF-8 decoding and
encoding?  If not, I'll write something simple.

Thanks,
Dave Brown
___
Glasgow-haskell-users mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users