Re: UTF-8 library

2002-08-10 Thread Manuel M T Chakravarty

"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> wrote,

> Thu, 08 Aug 2002 19:28:18 +1000 (EST), Manuel M T Chakravarty <[EMAIL PROTECTED]> 
>pisze:
> 
> > ANSI C guarantees that char is 1 byte (more precisely that
> > "sizeof (char)" == 1).
> 
> It says that sizeof (char) == 1 but doesn't say that it means 8 bits.
> sizeof is measured in chars, whatever it is. But limits for values
> of char imply that it has at least 8 bits.
> 
> Perhaps we can assume some widely true facts even if ANSI C doesn't
> guarantee that if it makes life easier. For example that a C type
> corresponding to Int32 exists at all, and that different pointer
> types have the same representation - we already rely on that, don't we?

Yes.  And as Ketil pointed out Worse Is Better.

Manuel
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Lennart Augustsson

Joe English wrote:

>Java attempts platform independence by declaring that all
>the world *is*, in fact, a VAX [*].
>
>
>[*] More precisely, a 32-bit platform with IEEE 754 floating point.
>
And the original VAX did in fact not have IEEE floating point. :-)

-- Lennart


___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Joe English


Ashley Yakeley  wrote:
>
> One of the things that really bothers me about C is the way its
> unspecifiedness about types can "infect" other languages. For instance,
> what exactly is a Haskell Int?
>
> Java, at least, stands firm, but then platform-independence was one of
> Java's explicit design priorities.

Platform-independence is *also* one of Standard C's explicit
design goals, it just approaches it in a different way.

Standard C attempts platform independence by specifying the
existence of a certain number of built-in numeric types,
and certain guarantees about each of them.  It requires
that programmers know what is and is not guaranteed, however,
and write code accordingly.  It's possible to write portable
code in C, but you must abandon the assumption that (for
instance) an 'int' is exactly 32 bits, since that's not true
on all platforms.  The slogan is "All the world is not a VAX."

Java attempts platform independence by declaring that all
the world *is*, in fact, a VAX [*].


[*] More precisely, a 32-bit platform with IEEE 754 floating point.

--Joe English

  [EMAIL PROTECTED]
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread anatoli

[apologies if you see multiple copies; I forgot to Cc: the list
the first time around.]

--- Sven Moritz Hallberg <[EMAIL PROTECTED]> wrote:

> [...] I think that it's
> ugly, though, to do it somewhere outside, pretending the issue's not
> there. I value about Haskell it's clean representation of reality.
> Attaching all kinds of state to handles just isn't as clear as "Look
> here, a file: It's a sequence of octets.", "Watch out though, each file
> can use an entirely different encoding.", "The Char versions of the IO
> functions will try to deal with encoding for you.", and "If you know you
> need some special treatment, we have these functions blahblahblah..."

As I view it, a Handle is always a stream of Char data. Why? Simply because Haskell
threats Handles as streams of Char data *today*. There's no good reason to change
that, unless you want to wheak havoc in existing programs.

To make things i18n-friendly, the simplest (IMHO) approach is to declare that
under each Hadle (i.e. Char stream) there is a BinaryHandle (i.e. Word8 stream)
*plus* an associated encoding (and also maybe CR/LF handler while we're at it).

I certainly don't want the same Handle type to be able to represent a sequence of 
octets and a sequence of Char at the same time.

> > I routinely read and write messages in three different languages that
> > use three different encodings. All of them are my "own" languages.
> 
> Where is the problem? The system is not going to be able to decide which
> one to use either way, so you must make the encoding explicit. Now we
> just have to come up with a convenient way to do it. Transforming
> between [Word8] and [Char] seems plausible to me.

I want to be able to specify encoding explicitly *and* be able to use existing
Char IO, because that's what my programs use *today* and I don't want to rework 
them. Rewriting all my IO because it's now Word8-based instead of Char-based is
NOT convenient.
 
> > A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can transform
> > [Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
> > is needed as well.
> 
> The only reason for that would be efficiency. Simon said something about
> that. I admit that I have no clue about it.

What about backward compatibility? With my approach, in order to make a Haskell
program i18n-aware, you only need to change a few calls to openFile and make them
openFileWithEncoding. Otherwise they will just use default encoding.

-- 
a.

__
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread David Feuer

On Sat, 10 Aug 2002, Ashley Yakeley wrote:

> One of the things that really bothers me about C is the way its
> unspecifiedness about types can "infect" other languages. For instance,
> what exactly is a Haskell Int?

I think it's the idea that's infectious, because it is a good idea.  The C
standard didn't do this for fun, but to allow programmers to find out what
integer type would be most efficient for general use.  The ideal is
probably to provide Int, Int8, Int16, Int32, Int64, Int128, Word8, Word16,
Word32, Word64, Word128.  Or something.

> Java, at least, stands firm, but then platform-independence was one of
> Java's explicit design priorities.

And all Java programs run equally slowly.

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Sven Moritz Hallberg

On Sat, 2002-08-10 at 12:03, anatoli wrote:
> --- Sven Moritz Hallberg <[EMAIL PROTECTED]> wrote:
> > I argue _strongly_ against associating some sort of locale state with
> > handles.
> > 
> > 1) In agreement with Ashley's statements, file IO should use octets,
> > because that's what's in a file.
> 
> By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
> (Files don't have lines in them, they are just sequences of octets.)

That's a good point, I've forgotten about this mess. I think that it's
ugly, though, to do it somewhere outside, pretending the issue's not
there. I value about Haskell it's clean representation of reality.
Attaching all kinds of state to handles just isn't as clear as "Look
here, a file: It's a sequence of octets.", "Watch out though, each file
can use an entirely different encoding.", "The Char versions of the IO
functions will try to deal with encoding for you.", and "If you know you
need some special treatment, we have these functions blahblahblah..."


> I prefer somewhat higher-level view of files.

Of course, so do I, I just want the higher-level view to be implemented
in Haskell, not under the hood of some ominous "handle" type; which,
btw, will then no longer be simply a handle but some sort of great big
file IO "object". That's confusing for anyone who hasn't been exposed to
the C way of dealing with files. I'd teach some old people clean
concepts they might not be used to, rather than repeating the same old
yuck to every new little programmer who's just starting.


> > 2) If you need to decode those octets to characters, or vice-versa,
> > compose a (de)serialization function before it.
> 
> I *always* need that. (Except for binary IO). Might as well have this 
> functionality built in a handle.

Well, then *always* use the Char functions. I don't see the point.


> > 3) A "best shot" character reading(or writing, for that matter)
> > function, will be convenient. This should probably use your current
> > locale, because when writing a character, you'll probably want to be
> > able to write your own language's characters correctly.
> 
> I routinely read and write messages in three different languages that
> use three different encodings. All of them are my "own" languages.

Where is the problem? The system is not going to be able to decide which
one to use either way, so you must make the encoding explicit. Now we
just have to come up with a convenient way to do it. Transforming
between [Word8] and [Char] seems plausible to me.


> > 4) For decoding, we'll need some parsing functionality, as someone
> > already mentioned. With that we can have functions like parseUTF8.
> > "Associating a locale with a stream", as you put it, is a matter of, if
> > f is the raw Word8 stream, g = parseUTF8 f, where g is the Char stream,
> > parsed as UTF-8-encoded characters from f.
> 
> A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can transform
> [Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
> is needed as well.

The only reason for that would be efficiency. Simon said something about
that. I admit that I have no clue about it.


Sven Moritz




signature.asc
Description: This is a digitally signed message part


Re: UTF-8 library

2002-08-10 Thread anatoli


--- Ashley Yakeley <[EMAIL PROTECTED]> wrote:
> >By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
> >(Files don't have lines in them, they are just sequences of octets.)
> 
> Correct. Exactly what kind of newline do you want in your file?

The correct answer depends on the level of abstraction. It can be either
"some specific kind of newline" or "whatever kind the OS wants", but
mostly it's "I don't care" (i.e. "whatever kind the Handle wants").

> >A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can 
> >transform
> >[Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
> >is needed as well.
> 
> Well, it should be a utility library built on top of the real Word8-based 
> functions:
> 
>   data TextHandle = MkTextHandle Handle TextEncoding;
>   etc.

I have no problem with that, except for the naming. Current IO functions
are mostly text-based and centered around Handles, and there's no good reason
to break that. Thus, your TextHandle probably should be a Handle and your
Handle probably should be a BinaryHandle.

Plus, the utility library should probably live on the C side, but that's
an implementation detail :)

-- 
a.


__
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Ashley Yakeley

At 2002-08-10 03:03, anatoli wrote:

>--- Sven Moritz Hallberg <[EMAIL PROTECTED]> wrote:
>> I argue _strongly_ against associating some sort of locale state with
>> handles.
>> 
>> 1) In agreement with Ashley's statements, file IO should use octets,
>> because that's what's in a file.
>
>By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
>(Files don't have lines in them, they are just sequences of octets.)

Correct. Exactly what kind of newline do you want in your file?

>I prefer somewhat higher-level view of files.

Well, that's what encoding functions are for. You can take higher-level 
views of your octets as text, images, XML-structures, experimental 
datasets, whatever.

What's so special about text that the functionality should be bound 
_right into the API_?

>> 2) If you need to decode those octets to characters, or vice-versa,
>> compose a (de)serialization function before it.
>
>I *always* need that. (Except for binary IO).

You *always* need that. (Except when you don't).

The term of "binary" is quite misleading. It suggests a particular file 
type, but it's actually used to mean "something other than 
ASCII-compatible text". One might as well have a word that means 
"something other than a JPEG image".

...
>A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can 
>transform
>[Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
>is needed as well.

Well, it should be a utility library built on top of the real Word8-based 
functions:

  data TextHandle = MkTextHandle Handle TextEncoding;
  etc.


-- 
Ashley Yakeley, Seattle WA

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread anatoli

--- Sven Moritz Hallberg <[EMAIL PROTECTED]> wrote:
> I argue _strongly_ against associating some sort of locale state with
> handles.
> 
> 1) In agreement with Ashley's statements, file IO should use octets,
> because that's what's in a file.

By the same token, we should handle CR/LF/CR-LF/LF-CR mess by hand.
(Files don't have lines in them, they are just sequences of octets.)

I prefer somewhat higher-level view of files.

> 2) If you need to decode those octets to characters, or vice-versa,
> compose a (de)serialization function before it.

I *always* need that. (Except for binary IO). Might as well have this 
functionality built in a handle.

> 3) A "best shot" character reading(or writing, for that matter)
> function, will be convenient. This should probably use your current
> locale, because when writing a character, you'll probably want to be
> able to write your own language's characters correctly.

I routinely read and write messages in three different languages that
use three different encodings. All of them are my "own" languages.

> 4) For decoding, we'll need some parsing functionality, as someone
> already mentioned. With that we can have functions like parseUTF8.
> "Associating a locale with a stream", as you put it, is a matter of, if
> f is the raw Word8 stream, g = parseUTF8 f, where g is the Char stream,
> parsed as UTF-8-encoded characters from f.

A "Word8 stream" can be either Handle (Word8Handle?) or [Word8]. We can transform
[Word8] to [Char], but not Word8Handle to CharHandle. I argue that the latter
is needed as well.

-- 
a.

__
Do You Yahoo!?
HotJobs - Search Thousands of New Jobs
http://www.hotjobs.com
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Marcin 'Qrczak' Kowalczyk

Sat, 10 Aug 2002 01:31:51 -0700, Ashley Yakeley <[EMAIL PROTECTED]> pisze:

>>that different pointer
>>types have the same representation - we already rely on that, don't we?
> 
> No, we have separate Ptrs and FunctionPtrs IIRC...

Yes, but I mean the possibility that Ptr Word8 looks differently than
Ptr Word32.

> One of the things that really bothers me about C is the way its
> unspecifiedness about types can "infect" other languages. For
> instance, what exactly is a Haskell Int?

It's unrelated to C. Int must have at least 30 bits. It's reasonable
to make it 32 bits on IA-32 and 64 bits on IA-64. No need to exactly
specify its range because we have Integer and Int8/16/32/64 if needed,
and Int is the type appropriate for measuring array lengths for example
(we won't have gigabyte arrays in IA-32 but might have them on IA-64).

-- 
  __("<  Marcin Kowalczyk
  \__/ [EMAIL PROTECTED]
   ^^http://qrnik.knm.org.pl/~qrczak/

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Marcin 'Qrczak' Kowalczyk

09 Aug 2002 10:17:21 +0200, Sven Moritz Hallberg <[EMAIL PROTECTED]> pisze:

> I argue _strongly_ against associating some sort of locale state with
> handles.
> 
> 1) In agreement with Ashley's statements, file IO should use octets,
> because that's what's in a file.

So it would imply two types raw Handles for binary data and wrapped
text Handles for strings. You don't want to force users to explicitly
perform the conversion over and over I hope? Wrappers would look as
IO now.

But I don't see much point in separating those Handles. And it would
be more efficient if conversion could be done internally on the C side
(locale-dependent encoding functions are available from C) coupled with
Handle buffers, rather than going through a pure Haskell interface in
the middle.

-- 
  __("<  Marcin Kowalczyk
  \__/ [EMAIL PROTECTED]
   ^^http://qrnik.knm.org.pl/~qrczak/

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Ashley Yakeley

At 2002-08-10 01:21, Marcin 'Qrczak' Kowalczyk wrote:

>Perhaps we can assume some widely true facts even if ANSI C doesn't
>guarantee that if it makes life easier. For example that a C type
>corresponding to Int32 exists at all, and that different pointer
>types have the same representation - we already rely on that, don't we?

No, we have separate Ptrs and FunctionPtrs IIRC...

One of the things that really bothers me about C is the way its 
unspecifiedness about types can "infect" other languages. For instance, 
what exactly is a Haskell Int?

Java, at least, stands firm, but then platform-independence was one of 
Java's explicit design priorities.

-- 
Ashley Yakeley, Seattle WA

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Marcin 'Qrczak' Kowalczyk

Thu, 8 Aug 2002 09:59:12 -0700 (PDT), anatoli <[EMAIL PROTECTED]> pisze:

> I'd still rather associate locale with a handle.

I agree. http://www.sf.net/projects/qforeign/ contains an experimental
character recoding library with a IO module wrapper which associates
encodings with Handles. But I don't like my own library (seems ugly
and complex). It has rotted somehow, I had unresolved packaging issues
witn newest ghcs.

-- 
  __("<  Marcin Kowalczyk
  \__/ [EMAIL PROTECTED]
   ^^http://qrnik.knm.org.pl/~qrczak/

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-10 Thread Marcin 'Qrczak' Kowalczyk

Thu, 08 Aug 2002 19:28:18 +1000 (EST), Manuel M T Chakravarty <[EMAIL PROTECTED]> 
pisze:

> ANSI C guarantees that char is 1 byte (more precisely that
> "sizeof (char)" == 1).

It says that sizeof (char) == 1 but doesn't say that it means 8 bits.
sizeof is measured in chars, whatever it is. But limits for values
of char imply that it has at least 8 bits.

Perhaps we can assume some widely true facts even if ANSI C doesn't
guarantee that if it makes life easier. For example that a C type
corresponding to Int32 exists at all, and that different pointer
types have the same representation - we already rely on that, don't we?

-- 
  __("<  Marcin Kowalczyk
  \__/ [EMAIL PROTECTED]
   ^^http://qrnik.knm.org.pl/~qrczak/

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-09 Thread Sven Moritz Hallberg

On Thu, 2002-08-08 at 18:26, anatoli wrote:
> Having a locale associated with each individual stream is much more
> convenient.

I argue _strongly_ against associating some sort of locale state with
handles.

1) In agreement with Ashley's statements, file IO should use octets,
because that's what's in a file.

2) If you need to decode those octets to characters, or vice-versa,
compose a (de)serialization function before it.

3) A "best shot" character reading(or writing, for that matter)
function, will be convenient. This should probably use your current
locale, because when writing a character, you'll probably want to be
able to write your own language's characters correctly.

4) For decoding, we'll need some parsing functionality, as someone
already mentioned. With that we can have functions like parseUTF8.
"Associating a locale with a stream", as you put it, is a matter of, if
f is the raw Word8 stream, g = parseUTF8 f, where g is the Char stream,
parsed as UTF-8-encoded characters from f.


Sven Moritz

___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-09 Thread Fergus Henderson

On 06-Aug-2002, George Russell <[EMAIL PROTECTED]> wrote:
> 
> Converting CStrings to [Word8] is probably a bad idea anyway, since there is
> absolutely no reason to assume a C character will be only 8 bits long, and
> under some implementations it isn't. 

That's true in general; the C standard only guarantees that a C character
will be at least 8 bits long.

But Posix now guarantees that C's `char' is exactly 8 bits.

Posix hasn't taken over the world yet, and doesn't look like doing so
in the near future.  So Haskell should not limit itself to being only
implementable on Posix systems.  However, systems which don't have 8-bit
bytes are getting very very rare nowadays -- it might well be reasonable
for Haskell, like Posix, to limit itself to only being implementable
on systems where C's `char' is exactly 8 bits.

-- 
Fergus Henderson <[EMAIL PROTECTED]>  |  "I have always known that the pursuit
The University of Melbourne |  of excellence is a lethal habit"
WWW:   | -- the last words of T. S. Garp.
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-09 Thread Ketil Z. Malde

anatoli <[EMAIL PROTECTED]> writes:

> Dependence on the current locale is EXTREMELY inconvenient.
> Imagine that you're writing a Web browser.

Web browsers get input with MIME declarations, and shouldn't rely on
*any* default setting.   Instead, they should read [Word8] and decode
the contents according to Content-Type/Content-Transfer-Encoding.

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell



Re: UTF-8 library

2002-08-08 Thread Joe English


anatoli wrote:

> I'd still rather associate locale with a handle. This way, all
> Char and String IO functions that exist, and those that are not
> written yet, can work with any encoding without relying on the
> abomination that is setlocale().

Seconded; this is the best approach.  The libc locale could
be consulted to determine the initial or default encoding,
or it could just be ignored (I'd vote to ignore it; setlocale() 
*is* an abomination.)

BTW, this is how Tcl does it -- each file handle has an associated
encoding (which may be changed on the fly) -- and it's very convenient.


--Joe English

  [EMAIL PROTECTED]
___
Haskell mailing list
[EMAIL PROTECTED]
http://www.haskell.org/mailman/listinfo/haskell