Re: [Haskell-cafe] getting crazy with character encoding

2007-09-13 Thread John Meacham
On Wed, Sep 12, 2007 at 05:19:22PM +0200, Andrea Rossato wrote:
 And so it's my job to convert it in what I need. Luckily I've just
 discovered (and now I'm reading) some of John Meacham's code on
 locale. This is going to be very helpful (unfortunately I don't see
 Licenses coming with HsLocale, but if I'm reading correctly there is
 something like this in Riot - and this was BSD3 released).

it is BSD3. in general, pretty much everything I write is BSD3 except
for large projects as a whole which get GPL=2. Though I am more than
happy to BSD3 any incidentally useful parts of my projects that others
would find useful.


John Meacham - ⑆⑆john⑈
Haskell-Cafe mailing list

[Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato

supposed that, in a Linux system, in an utf-8 locale, you create a file
with non ascii characters. For instance:
touch abèèè

Now, I would expect that the output of a shell command such as 
ls ab*
would be a string/list of 5 chars. Instead I find it to be a list of 8

That is to say, each non ascii character is read as 2 characters, as
if the string were an ISO-8859-1 string - the string is actually
treated as an ISO-8859-1 string. But when I print it, now it is
displayed correctly.

I don't understand what's wrong and, this is worse, I don't understand
what I should be studying to understand what I'm doing wrong.

After reading about character encoding, the way the linux kernel
manages file names, I would expect that a file name set in an utf-8
locale should be read by locale aware application as an utf-8 string,
and each character a unicode code point which can be represented by a
Haskell char. What's wrong with that?

Thanks for your kind attention.


Here the code to test my problem. Before creating the file remember to
set the LANG environmental variable. Something like: 
export LANG=en_US.utf8 
should be fine. (Check your available locales with locale -a)

import System.Process
import System.IO
import Control.Monad

main = do
  l - fmap lines $ runProcessWithInput /bin/bash [] ls ab*
  putStrLn (show l)
  mapM_ putStrLn l
  mapM_ (putStrLn . show . length) l

runProcessWithInput cmd args input = do
  (pin, pout, perr, ph) - runInteractiveProcess cmd args Nothing Nothing
  hPutStr pin input
  hClose pin
  output - hGetContents pout
  when (output==output) $ return ()
  hClose pout
  hClose perr
  waitForProcess ph
  return output

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Brandon S. Allbery KF8NH

On Sep 12, 2007, at 10:18 , Andrea Rossato wrote:

supposed that, in a Linux system, in an utf-8 locale, you create a  

with non ascii characters. For instance:
touch abèèè

Now, I would expect that the output of a shell command such as
ls ab*
would be a string/list of 5 chars. Instead I find it to be a list of 8

That is expected.  The low level filesystem storage doesn't know  
about character sets, so non-ASCII filenames must be encoded in e.g.  
UTF-8.  8 characters is therefore correct, and you must do UTF-8  
decoding on input because Haskell does not do so automatically.

This will also be true with getdirent() aka getDirectoryContents.

brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [EMAIL PROTECTED]
system administrator [openafs,heimdal,too many hats] [EMAIL PROTECTED]
electrical and computer engineering, carnegie mellon universityKF8NH

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Seth Gordon

Andrea Rossato wrote:


supposed that, in a Linux system, in an utf-8 locale, you create a file
with non ascii characters. For instance:
touch abèèè

Now, I would expect that the output of a shell command such as 
ls ab*

would be a string/list of 5 chars. Instead I find it to be a list of 8

The file name may have five *characters*, but if it's encoded as UTF-8, 
then it has eight *bytes*.

It appears that in spite of the locale definition, hGetContents is 
treating each byte as a separate character without translating the 
multi-byte sequences *from* UTF-8, and then putStrLn sends each of those 
bytes to standard output without translating the non-ASCII characters 
*to* UTF-8.  So the second line of your program's output is 
correct...but only by accident.

Futzing around a little bit in ghci, I see that I can define a string 
\1488, but if I send that string to putStrLn, I get nothing, when I 
should get א (the Hebrew letter aleph).

I � Unicode.

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 10:53:29AM -0400, Brandon S. Allbery KF8NH wrote:
  That is expected.  The low level filesystem storage doesn't know about 
  character sets, so non-ASCII filenames must be encoded in e.g. UTF-8.  8 
  characters is therefore correct, and you must do UTF-8 decoding on input 
  because Haskell does not do so automatically.

Ahh, now I eventually get it! So, as far as I understand, I'm getting
bytes that are automatically translated into an iso-8859-1 string, if
I'm correctly reading this old post by Glynn:

And so it's my job to convert it in what I need. Luckily I've just
discovered (and now I'm reading) some of John Meacham's code on
locale. This is going to be very helpful (unfortunately I don't see
Licenses coming with HsLocale, but if I'm reading correctly there is
something like this in Riot - and this was BSD3 released).

Thanks for your kind attention.


Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Dougal Stanton
On 12/09/2007, Seth Gordon [EMAIL PROTECTED] wrote:

 I � Unicode.

Was it intentional that the central character appears as a little '?',
even though the aleph on the line above worked? Either way it would be
very amusing, but for different reasons...

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote:
  It appears that in spite of the locale definition, hGetContents is treating 
  each byte as a separate character without translating the multi-byte 
  sequences *from* UTF-8, and then putStrLn sends each of those bytes to 
  standard output without translating the non-ASCII characters *to* UTF-8.  So 
  the second line of your program's output is correct...but only by accident.

that's it indeed. As I said in the message I've just sent, I've read
that the String/CString conversion is automatically done in
ISO-8859-1, so èèè, which are 6 bytes in utf-8, are translated
into 6 iso-8859-1 characters.

What puzzles me is the behavior of putStrLn.

Thanks for your time.


Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Seth Gordon

Andrea Rossato wrote:

What puzzles me is the behavior of putStrLn.

putStrLn is sending the following bytes to standard output:

97, 98, 195, 168, 195, 168, 195, 168, 10

Since the code that renders characters in your terminal emulator is 
expecting UTF-8[*], each (195, 168) pair of bytes is rendered as è.

The Unix utility od can be very helpful in figuring out problems like 

[*]At least on my computer, I get the same result *even if* I change 
LANG from en_US.utf8 to C.

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Seth Gordon

Dougal Stanton wrote:

On 12/09/2007, Seth Gordon [EMAIL PROTECTED] wrote:

I � Unicode.

Was it intentional that the central character appears as a little '?',
even though the aleph on the line above worked?

It was intentional.  If I ♡ed Unicode, I would have said so.

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Andrea Rossato
On Wed, Sep 12, 2007 at 11:40:11AM -0400, Seth Gordon wrote:
  The Unix utility od can be very helpful in figuring out problems like 

Thanks for pointing me to od, I didn't know it.

  [*]At least on my computer, I get the same result *even if* I change LANG 
  from en_US.utf8 to C.

As far as I understand it is the terminal emulator responsible for
translating the bytes to characters. If I run it in a console I get
abAAA (sort of) no matter what my LANG is - 8 single 8 -bit


Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread David Benbennick
On 9/12/07, Andrea Rossato [EMAIL PROTECTED] wrote:
 If I run it in a console I get
 abAAA (sort of) no matter what my LANG is - 8 single 8 -bit

It's possible to set your Linux console to grok UTF8.  I don't
remember the details, but I'm sure you can Google for it.

By the way, does anyone know The Right Way to deal with UTF-8 in
Haskell?  I.e., take that 8 byte UTF-8 string and convert it to a 5
character Unicode string (so it can be manipulated)?
Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Jules Bean

David Benbennick wrote:

On 9/12/07, Andrea Rossato [EMAIL PROTECTED] wrote:

If I run it in a console I get
abAAA (sort of) no matter what my LANG is - 8 single 8 -bit

It's possible to set your Linux console to grok UTF8.  I don't
remember the details, but I'm sure you can Google for it.

By the way, does anyone know The Right Way to deal with UTF-8 in
Haskell?  I.e., take that 8 byte UTF-8 string and convert it to a 5
character Unicode string (so it can be manipulated)?

There is no UTF8 decode support in the standard libraries.

There are some contributed libraries which can do it. Data.CompactString 
is one.

Haskell-Cafe mailing list

Re: [Haskell-cafe] getting crazy with character encoding

2007-09-12 Thread Don Stewart
 On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote:
   It appears that in spite of the locale definition, hGetContents is 
   each byte as a separate character without translating the multi-byte 
   sequences *from* UTF-8, and then putStrLn sends each of those bytes to 
   standard output without translating the non-ASCII characters *to* UTF-8.  
   the second line of your program's output is correct...but only by accident.
 that's it indeed. As I said in the message I've just sent, I've read
 that the String/CString conversion is automatically done in
 ISO-8859-1, so èèè, which are 6 bytes in utf-8, are translated
 into 6 iso-8859-1 characters.
 What puzzles me is the behavior of putStrLn.
 Thanks for your time.

Have you tried the utf8-string conversion library?

-- Don
Haskell-Cafe mailing list