Re: [Haskell-cafe] Re: getting crazy with character encoding
On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote: On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney [EMAIL PROTECTED] wrote a message of 76 lines which said: the characters read and written should correspond to the native environment notions and encodings. These are, under Unix, determined by the locale system. Locales, while fine for things like the language of the error messages or the format to use to display the time, are *not* a good solution for things like file names and file contents. Even on a single Unix machine (without networking), there are *several* users. Using the locale to find out the charset used for a file name won't work if these users use different locales. Yes indeed. And I find it a real mess. And I don't see any way out. Same thing for file contents. The charset used must be marked in the file (XML...) or in the metadata, somehow. Otherwise, there is no way to exchange files or even to change the locale (if I switch from Latin1 to UTF-8, what do my files become?) Ok, you are perfectly right, but we live in an imperfect world and we must come up with a solution. In my case I'm developing this prompt for xmonad and a Chinese user wants directory and file names to be correctly displayed. What else can I do but using locale technologies? This is something I don't know. The code below is not perfect but it works to some extent. Nonetheless, if you have 2 users using an iso-8859-1 locale the first and utf-8 one the second, non ascii characters in file names of the first users will produce invalid character sequences for the second users. The reverse will work, though. I'm still puzzled and still find the thread title appropriate. Thanks for your kind attention. Andrea The locale aware version of the previous code (needs hsc2hs) {-# OPTIONS -fglasgow-exts #-} import Prelude hiding (catch) import System.Process import System.IO import Control.Monad import System.Directory import Foreign import Foreign.C import Data.Char import Control.Exception runProcessWithInput cmd args input = do (pin, pout, perr, ph) - runInteractiveProcess cmd args Nothing Nothing hPutStr pin input hClose pin output - hGetContents pout when (output==output) $ return () hClose pout hClose perr waitForProcess ph return output main = do setupLocale l - fmap lines $ runProcessWithInput /bin/bash [] ls ab*\n l' - mapM fromLocale l l'' - mapM toLocale l' putStrLn (show l') mapM_ putStrLn l'' mapM_ (putStrLn . show . length) l' -- This code comes from John Meacham's HsLocale -- http://repetae.net/john/repos/HsLocale/ toLocale :: String - IO String toLocale s = catch (stringToBytes s = return . map (chr . fromIntegral)) (const $ return invalid character sequence) fromLocale :: String - IO String fromLocale s = bytesToString (map (fromIntegral . ord) s) `catch` \_ - return invalid character sequence stringToBytes :: String - IO [Word8] stringToBytes cs = (withIConv UTF-32 $ \ic - convertRaw ic cs) bytesToString :: [Word8] - IO String bytesToString xs = (withIConv UTF-32 $ \ic - convertRaw ic xs) = return . f where f ('\65279':xs) = xs -- discard byte order marker f xs = xs newtype IConv = IConv (#type intptr_t) deriving(Num,Eq,Show) foreign import ccall unsafe iconv.h iconv_open iconv_open :: Ptr CChar - Ptr CChar - IO IConv foreign import ccall unsafe iconv.h iconv_close iconv_close :: IConv - IO CInt foreign import ccall unsafe iconv.h iconv iconv :: IConv - Ptr (Ptr CChar) - Ptr CSize - Ptr (Ptr CChar) - Ptr CSize - IO CInt withIConv :: String - String - (IConv - IO a) - IO a withIConv to from action = bracket open close action where close ic = throwErrnoIfMinus1_ iconv_close (iconv_close ic) open = throwErrnoIfMinus1 iconv_open iopen iopen = do withCAString to $ \to - do withCAString from $ \from - do iconv_open to from convertRaw :: (Storable a, Storable b) = IConv - [a] - IO [b] convertRaw ic xs = do with (fromIntegral $ sizeOf (head xs) * length xs) $ \inptrSz - do withArray xs $ \arr - do with (castPtr arr) $ \inptr - do allocaBytes (1024) $ \outptr - do with outptr $ \outptrptr - do with 1024 $ \outptrSz - do let outSz = fromIntegral $ sizeOf $ unsafePerformIO (peek outptr) let go = do ret - iconv ic inptr inptrSz (castPtr outptrptr) outptrSz err - getErrno case (ret,err) of (-1,_) | err == e2BIG - do oz - peek outptrSz x - peekArray ((1024 - fromIntegral oz) `div` outSz) (castPtr outptr) poke outptrptr outptr poke outptrSz 1024 y - go return $ x ++ y (-1,_) - throwErrno iconv (_,_) - do oz - peek outptrSz peekArray ((1024
Re: [Haskell-cafe] Re: getting crazy with character encoding
On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote: On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote: Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars), I'm pretty sure Hugs does the right thing. ..which makes me wonder what the right thing actually is? Since IO on Unix (or at least on Linux) consists of bytes, I don't see how a Unicode-only interface is ever going to do the 'right thing' for all people. One possible solution might be to have IO functions deal with [Word8] instead of [Char]. If string and character constants were polymorphic, Char and String made aliases for byte-based types, and a new type introduced for Unicode characters, it might even be possible to fix without breaking absolutely all legacy code. But even this would probably only fix the Unix side of things. -k ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: getting crazy with character encoding
On Thu, Sep 13, 2007 at 12:06:15PM +0200, Ketil Malde wrote: On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote: On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote: Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars), I'm pretty sure Hugs does the right thing. ..which makes me wonder what the right thing actually is? Since IO on Unix (or at least on Linux) consists of bytes, I don't see how a Unicode-only interface is ever going to do the 'right thing' for all people. I never said it was Unicode-only. hGetBuf / hPutBuf - Raw Word8 access getChar etc - Uses locale info Stefan signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: getting crazy with character encoding
On Thu, Sep 13, 2007 at 06:49:59AM -0700, Stefan O'Rear wrote: On Thu, Sep 13, 2007 at 12:06:15PM +0200, Ketil Malde wrote: On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote: On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote: Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars), I'm pretty sure Hugs does the right thing. ..which makes me wonder what the right thing actually is? Since IO on Unix (or at least on Linux) consists of bytes, I don't see how a Unicode-only interface is ever going to do the 'right thing' for all people. I never said it was Unicode-only. hGetBuf / hPutBuf - Raw Word8 access getChar etc - Uses locale info The problem is that the type of openFile and getArgs is wrong, so there's no right way to get a Handle (other than stdin) to read from in the first place, unless we're willing to allow the current weird behavior of treating a [Char] as [Word8]. -- David Roundy Department of Physics Oregon State University signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: getting crazy with character encoding
On Thu, Sep 13, 2007 at 12:23:33AM +, Aaron Denney wrote: Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars), and there are in fact external libraries using lower level interfaces, rather than the things like putStr, getLine, etc. that do this. An external library works, of course, but it should be part of the standard so implementors know that character based routines actually are character based, not byte based. ... I don't know what NHC and hugs do, though I assume they also provide no translations. I'm also not sure what JHC does, though I do see mentions of UTF-8, UTF-16 (for windows), and UTF-32 (for internal usage of C libraries), and I do know that John is fairly careful about locale issues. I'm pretty sure Hugs does the right thing. NHC is probably broken. In any case, we already have hGetBuf / hPutBuf in the standard base libaries for raw binary IO, so code that uses getChar for bytes really has no excuse. We can and should fix the bug. Stefan signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe