Re: simple binary IO proposition.

Marcin 'Qrczak' Kowalczyk Fri, 01 Sep 2000 06:58:19 -0700
Fri, 1 Sep 2000 00:07:22 -0700, John Meacham <[EMAIL PROTECTED]> pisze:

> but the whole point of the module was to finally distingwish between
> Bytes and Chars and provide a base mechanism which allows machine and
> compiler independant IO with haskell..

This is independent of the issue whether bytes will be called Bytes
or Chars.

> i mean what is the point of a strong type system if we go out of our
> way to loose distinctions between values, a raw byte stream and a
> String are two very different entities and should not be equivalated.

The only sane reasons is to be more type safe. I usually stand
on the side of larger type safety, but here I think it introduces
inconvenience. It's not a strong position. I considered using Word8
or a separate Byte type too.

> also, there needs to be a simple mechanism which gets rid of all that
> \r\n <-> \n conversion and any sort of processing on the data.

OK. Actually GHC provides this as an extension (IOExts.openFileEx,
IOExts.hSetBinaryMode).

> it is fine to have hGetContents which gets text out of a file as
> approprate for your system, but rarely, in fact almost never,
> is that what i actually want to do with IO, you usually have a
> well-specified, exact format of data you wish to read or write
> which is independant of the system or architecture you happen to
> be running on,

OK. 

> a web browser. sometimes you get jpeg's sometimes you get ASCII
> somtimes you get UTF16 LittleEndian off of a socket. i cannot
> and should not expect conversions everything to be built into the
> language. i just want a byte stream from the socket i can interpret
> as i see fit.

I don't want to be _forced_ to get only a byte stream and interpret it
myself. I want to be able to say "this socket works in UTF-8 with \r\n
line terminators, our Haskell program as always works in Unicode with
\n, so please convert accordingly". Another time I will say "please
don't apply any conversion to that socket, I will do it myself -
I just want a stream of values '\0'..'\xFF'". By default all files
will use the locale-specific multibyte encoding.

There is a choice whether all functions will represent data in Chars or
some in Chars and some in Bytes (and some will come in both variants).
This is not a choice of expressiveness but how the interface and
implementation will look like.

> most times you dont want programs that are magically localized
> to where they are run, you want programs that act the exact same
> way everywhere.

Haskell assumes that Chars have a fixed encoding (Unicode). Otherwise
it would be quite impossible to have functions like isSpace or toUpper
or words.

C assumes that programs work internally in the default locale-specific
encoding, unless they apply a conversion explicitly. isSpace etc.
are locale-specific; in Haskell they would have to be in the IO monad.

The difference between these approaches is important.

Text files found in real life are in many possible encodings, and
nothing is in "the" Unicode (because there are several encodings that
express Unicode in a stream of bytes). How could putChar '\x1234'
work "the same way everywhere"? What should it output?

The C way has its advantages and disadvantages. A disadvantage is that
it is not possible to portably embed non-ASCII texts in the program.
String constants must be stored in external files, e.g. by using
gettext that will do the translation. OK, it is necessary anyway to
have the program localized for human languages, but it should not be
necessary in all cases, especially to be able to express more than
ASCII independently of the language[1]. Also Haskell currently does
not have a gettext equivalent.

I am not saying that it should not be possible to read binary files,
but that binary files should be a special case of Handles where
there is no conversion taking place. Handles must be able to do a
conversion anyway, so you would propose a separate primitive Handle
type for binary-only I/O.

GHC lets the OS handle line terminators. I would be happy to be able
to put this issue together with the rest of the conversion, but it's
harder to do portably, especially if the OS cares to translate file
positions (with my approach file positions are generally always in
bytes). Are there systems where the difference between text and binary
I/O means more than the line terminator convention?

> crlf conversion, utf8 encoding and whatnot can all be implemented
> trivially in haskell itself on top of raw IO and specified by the
> standards independantly.

I don't imagine how to do it efficiently without special support
in Handle internals. Interaction with buffering is quite complex,
and data should not be unnecessarily converted between Haskell and
C strings. Since a conversion implemented in C will be the default
for I/O, it should not be unnecessarily slow.

I imagine that on some system it might make sense to use OS wide
character functions instead of doing the conversion ourselves for
the default encoding. In this case Handle I/O will not live on top
of a byte stream.

--------
[1] Example. In the curses interface I am writing, VT100 semigraphics
    will be expressed in Unicode (behind descriptive names). The
    original curses approach is to use macros that expand to
    whatever chtype value encodes the given character on the current
    terminal. It is not applicable to Haskell: would require getting
    these values from the IO monad! The C solution comes from times
    where everybody was using chars and there was no space for
    other characters, so terminals have separate modes to access
    these characters and the curses interface partially reflects
    this. With wide Chars we can do better and express logical
    constants as physical constants. This another example of using
    a known internal encoding in Haskell instead of working directly
    in an unspecified encoding as in C.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK
Re: simple binary IO proposition.

Reply via email to