On Tue, Oct 09, 2001 at 12:37:27PM +0200, Kent Karlsson wrote: > > At 2001-10-09 02:58, Kent Karlsson wrote: > > >In summary: > > > code position (=code point): a value between 0000 and 10FFFF. > > Would this be a reasonable basis for Haskell's 'Char' type? > > Yes. It's essentially UTF-32, but without the fixation to 32-bit > (21 bits suffice). UTF-32 (a.k.a. UCS-4 in 10646, yet to be limited > to 10FFFF instead of 31(!) bits) is the datatype used in some > implementations of C for wchar_t. As I said in another e-mail, > if one does not have high efficiency concerns, UTF-32 is a rather > straighforward way of representing characters.
I think that perhaps space efficiency concerns are moot anyway since Char's would probably be represented by possibly evaluated thunks anyway which I can't imagine being smaller than a pointer in general so for haskell the simplification of UTF-32 is most likely worth it. If space efficiency is a concern than I imagine people would want to use mutable arrays of bytes or words anyway (perhaps mmap'ed from a file) and not haskell lists of Chars. > > At some point > > perhaps there should be a 'Unicode' standard library for Haskell. For > > instance: > > > > encodeUTF8 :: String -> [Word8]; > > decodeUTF8 :: [Word8] -> Maybe String; > > encodeUTF16 :: String -> [Word16]; > > decodeUTF16 :: [Word16] -> Maybe String; > > > > data GeneralCategory = Letter_Uppercase | Letter_Lowercase | ... > > getGeneralCategory :: Char -> Maybe GeneralCategory; > > There is not really any "Maybe" just there. Yet unallocated code > positions have general category Cn (so do non-characters): > Cs Other, Surrogate > Co Other, Private Use > Cn Other, Not Assigned (yet) > > > ...sorting & searching... > > > > ...canonicalisation... > > > > etc. Lots of work for someone. > > Yes. And it is lots of work (which is why I'm not volonteering > to make a qick fix: there is no quick fix). I think a cannonical way to get at iconvs ('man 3 iconv' for info.) functionality in one of the standard librarys would be great. perhaps I will have a go at it. even if the underlying platform does not have iconv then some basic conversions (utf8, utf16, latin1, [Char]) could easily be provided with the same API and minimal implementation effort. John -- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - [EMAIL PROTECTED] --------------------------------------------------------------------------- _______________________________________________ Haskell mailing list [EMAIL PROTECTED] http://www.haskell.org/mailman/listinfo/haskell