subject:"\"\\\\\\\[Haskell\\\\\\\-cafe\\\\\\\] Re\\\\\\\: Strings and utf\\\\\\\-8\""

Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-30 Thread Johan Tibell

> Am I wrong to think that UTF8 should be THE
> standard? I believe it can encode anything
> encoded by other encodings.

All the UTF-* encodings can encode the same code points. There are
different trade offs though.

> Can't we consider non-utf8 text as "legacy"?
> I don't like that word, but I do think it is
> the right way to go for text. If you know
> your text has a diferent encoding, just use
> 'iconv' to convert it, or a special Haskell
> library for conversion.

The important thing (I think) is to have an abstract concept that
encompasses all the necessary characters (i.e. Unicode) and then a few
well specified encodings with different trade offs. A Unicode Haskell
library should handle at least a few of them (and more importantly
keep track of the encoding.)

-- Johan
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: Strings and utf-8

2007-11-30 Thread Maurício


>> Language of messages is quite different
>> from language of a file you read. (...)

> Yes, it's a fundamental limitation of the
> unix locale system and multi-user
> systems. However it's no less wrong than
> just picking UTF8 all the time. (...)

Am I wrong to think that UTF8 should be THE
standard? I believe it can encode anything
encoded by other encodings.

Can't we consider non-utf8 text as "legacy"?
I don't like that word, but I do think it is
the right way to go for text. If you know
your text has a diferent encoding, just use
'iconv' to convert it, or a special Haskell
library for conversion.

That will make life difficult for a few, but
make life a lot easier for programers and
users.

Maurício

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Reinier Lamers


Thomas Hartman wrote:



A translation of

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

from perl to haskell would be a very useful piece of documentation, I 
think. 


Perl encodes both Unicode and binary data as the same (dynamic) data 
type. Haskell - at least in theory - has two different types for them, 
namely [Char] for characters and [Word8] or ByteString for sequences of 
bytes. I think the Haskell approach is better, because the programmer in 
most cases knows whether he wants to treat his data as characters or as 
bytes. Perl does it the Perlish "We guess at what the coder means" way, 
which leads to a lot of frustration when Perl guesses wrong.


The problems of the Haskeller trying to use Unicode, I think, will be 
different from those of the Perl hacker trying to use Unicode: the 
Haskeller will have to search for third-party modules to do what he 
wants, and finding those modules is the problem. The Perl hacker has all 
the Unicode support built in, but has to fight Perl occasionally to keep 
it from doing byte operations on his Unicode data.


I had a colleague here go all but insane last week trying to use 'split' 
on a Unicode string in Perl on Windows. split would break the string in 
the middle of a UTF-8 wide character, crashing UTF-8 processing later on.


Reinier
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Thomas Hartman

A translation of

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

from perl to haskell would be a very useful piece of documentation, I
think.

That explanation really helped me get to grips with the encoding stuff, in
a perl context.

thomas.

Duncan Coutts <[EMAIL PROTECTED]>
Sent by: [EMAIL PROTECTED]
11/29/2007 07:44 AM

To
Maurício <[EMAIL PROTECTED]>
cc
haskell-cafe@haskell.org
Subject
Re: [Haskell-cafe] Re: Strings and utf-8

On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote:
> >>(...)  When it's phrased as "truncates to 8
>  >> bits" it sounds so simple, surely all we need
>  >> to do is not truncate to 8 bits right?
>  >>
>  >> The problem is, what encoding should it pick?
>  >> UTF8, 16, 32, EBDIC? (...)
>  >>
>  >> One sensible suggestion many people have made
>  >> is that H98 file IO should use the locale
>  >> encoding and do Unicode/String <-> locale
>  >> conversion. (...)
>
> I'm really afraid of solutions where the behavior
> of your program changes with an environment
> variable that not everybody has configured
> properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all
locale dependent, not just for encoding but also for sorting order and
the language of messages.

Using the locale is standard Unix behaviour (and these days the locale
usually specifies UTF8 encoding). On OSX the default should be UTF8. On
Windows it's a bit less clear, supposedly text files should use UTF16
but nobody actually does that as far as I can see.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

---

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Duncan Coutts

On Thu, 2007-11-29 at 13:05 +, Jules Bean wrote:

> Language of messages is quite different from language of a file you read.
> 
> Suppose I am English, and I have a russian friend, Vlad.
> 
> My default locale is, say, latin-1, and his is something cyrillic.
> 
> I might well open files including my own files, and his files. The 
> locale of the current user is simple no guide to the correct encoding to 
> read a file in, and not a particularly reliable guide to writing a file out.
> 
> Locale makes perfect sense for messages (you are communicating with the 
> user, his locale tells you what language he speaks). It makes much less 
> sense for file IO.

Yes, it's a fundamental limitation of the unix locale system and
multi-user systems. However it's no less wrong than just picking UTF8
all the time. Obviously one needs a text file api that allows one to
specify the encoding for the cases where you happen to know it, but for
the H98 file api where there is no way of specifying an encoding, what's
better than using the unix default method? (at least on unix)

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Jules Bean

Duncan Coutts wrote:

On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote:

(...)  When it's phrased as "truncates to 8

 >> bits" it sounds so simple, surely all we need
 >> to do is not truncate to 8 bits right?
 >>
 >> The problem is, what encoding should it pick?
 >> UTF8, 16, 32, EBDIC? (...)
 >>
 >> One sensible suggestion many people have made
 >> is that H98 file IO should use the locale
 >> encoding and do Unicode/String <-> locale
 >> conversion. (...)

I'm really afraid of solutions where the behavior
of your program changes with an environment
variable that not everybody has configured
properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all
locale dependent, not just for encoding but also for sorting order and
the language of messages.

Language of messages is quite different from language of a file you read.

Suppose I am English, and I have a russian friend, Vlad.

My default locale is, say, latin-1, and his is something cyrillic.

I might well open files including my own files, and his files. The 
locale of the current user is simple no guide to the correct encoding to 
read a file in, and not a particularly reliable guide to writing a file out.

Locale makes perfect sense for messages (you are communicating with the 
user, his locale tells you what language he speaks). It makes much less 
sense for file IO.

Jules
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: Strings and utf-8

2007-11-29 Thread Duncan Coutts

On Wed, 2007-11-28 at 17:38 -0200, Maurício wrote:
> >>(...)  When it's phrased as "truncates to 8
>  >> bits" it sounds so simple, surely all we need
>  >> to do is not truncate to 8 bits right?
>  >>
>  >> The problem is, what encoding should it pick?
>  >> UTF8, 16, 32, EBDIC? (...)
>  >>
>  >> One sensible suggestion many people have made
>  >> is that H98 file IO should use the locale
>  >> encoding and do Unicode/String <-> locale
>  >> conversion. (...)
> 
> I'm really afraid of solutions where the behavior
> of your program changes with an environment
> variable that not everybody has configured
> properly, or even know to exist.

Be afraid of all your standard Unix utils in that case. They are all
locale dependent, not just for encoding but also for sorting order and
the language of messages.

Using the locale is standard Unix behaviour (and these days the locale
usually specifies UTF8 encoding). On OSX the default should be UTF8. On
Windows it's a bit less clear, supposedly text files should use UTF16
but nobody actually does that as far as I can see.

Duncan

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

[Haskell-cafe] Re: Strings and utf-8

2007-11-28 Thread Maurício


>>(...)  When it's phrased as "truncates to 8
>> bits" it sounds so simple, surely all we need
>> to do is not truncate to 8 bits right?
>>
>> The problem is, what encoding should it pick?
>> UTF8, 16, 32, EBDIC? (...)
>>
>> One sensible suggestion many people have made
>> is that H98 file IO should use the locale
>> encoding and do Unicode/String <-> locale
>> conversion. (...)

I'm really afraid of solutions where the behavior
of your program changes with an environment
variable that not everybody has configured
properly, or even know to exist.

> Wouldn't it be sensible not to use the H98 file
> I/O operations at all anymore with binary files?
> A Char represents a Unicode code point value and
> is not the right data type to use to represent a
> byte from a binary stream.

That seems nice, we would not have to create a
"wide char" type just for Unicode.

This topic made me search the net for that nice
quote:

"Explanations exist: they have existed for all
times, for there is always an easy solution to
every problem — neat, plausible and wrong."

(See: en.wikiquote.org/wiki/H._L._Mencken
That guy has many quotes worth reading.)

Strings as char lists is a very good example of
that. It's simple and clean, but strings are not
char lists in any reasonable sense.

Best,
Maurício

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Re: [Haskell-cafe] Re: Strings and utf-8

[Haskell-cafe] Re: Strings and utf-8

Re: [Haskell-cafe] Re: Strings and utf-8

Re: [Haskell-cafe] Re: Strings and utf-8

Re: [Haskell-cafe] Re: Strings and utf-8

Re: [Haskell-cafe] Re: Strings and utf-8

Re: [Haskell-cafe] Re: Strings and utf-8

[Haskell-cafe] Re: Strings and utf-8

8 matches

Site Navigation

Mail list logo

Footer information