Re: simple binary IO proposition.

Marcin 'Qrczak' Kowalczyk Mon, 04 Sep 2000 03:46:02 -0700
Sun, 3 Sep 2000 11:58:40 -0700, John Meacham <[EMAIL PROTECTED]> pisze:

>      * program determinism (soundness): the goal is to write one program,
>        guaranteed to produce the exact same byte output for a given byte
>        stream input regardless of compiler/platform it is run on. if
>        this is impossible for a given platform, the compilation should
>        die with an error,

With my proposal such programs are possible but not enforced. One can
make each Handle he uses binary + no conversion.

I doubt that there is a good way of determining if OS really does
what it was asked for, so there is no place for compilation errors.

I gave up trying to fit text/binary translation into the conversion
framework and instead made setting both attributes in same places.

>      * IO to/from externally defined byte formats: XDR encoded files,
>        RIFF files, network protocols, odd character encodings, people
>        need to be able to read and write these in a standard way to files
>        as well as things like sockets and pipes.

OK. It will only be a bit inefficient for multibyte binary numbers
unless the API exposes internal structures that are quite incompatible
with Haskell98 (pointers into C arrays).

>      * there is no Byte type for people writing new libraries, each
>        person who writes a library which works on Byte streams must come
>        up with their own kludge,

In GHC there is Word8, but to use it with files one has to translate
data by (fromIntegral . ord). Raw I/O currently produces data as Chars
'\0'..'\xFF'.

>        a common solution is to assume a Char is a byte, which is
>        not true, the language specifies that a Char is a unicode
>        encoded character, which arbitrary binary data is not, this
>        leads to confusion and programming errors because the type
>        information is lost, there is nothing to distinguish a raw
>        UTF8 encoded byte stream from an actual Haskell string,

Right. But UTF-8 data are not supposed to be used by programs directly.
When they operate on texts, they do it in the native Haskell format:
Unicode wide characters. The text was already converted on input and
will be converted back on output.

"Incorrect" UTF-8 in Chars is seen only by the conversion engine and
sometimes by low-level I/O code.

OTOH having Bytes separate from Chars would lead to duplication of
interfaces (and implementation) and additional parametrization of
conversion functions by types of input and output characters.

I will check how would it look like. Maybe it's not that bad...

Your view is focused on binary files: files consist of bytes which may
be interpreted as text if needed. My view is focused on text files:
files consist of characters (which may be physically represented in
various ways) which may be interpreted as binary data if needed.

Anyway, the binary I/O API would be an additional API - not an
underlying API. Efficient handling of conversions implemented in C
requires integrating conversions with I/O on a quite low level.

>        On many platforms Chars will be 32 bits, this results
>        in a 4 fold increase in space needed for evaluated byte streams.

Haskell lists already have an overhead. In GHC a list of bytes takes
exactly the same amount of memory as a list of Chars.

Moreover, Word8 in GHC and Hugs is currently implemented as a 32/64
bit word. It does not matter for standalone objects as long as they
behave correctly.

>        often you use Byte streams in areas which have nothing at all to
>        do with character or string encoding, the use of String and [Char]
>        in those cases would be confusing to new and experienced users.

Right.

>        if a database API has a function lookup :: [Char] -> [Char]
>        can it only work on strings? or arbitrary byte streams? what
>        is the character encoding used in the database if i want to
>        access it from C?

The encoding is not known for [Byte] -> [Byte] as well.

If the data is textual, it should be exposed in Unicode strings
so clients don't translate it themselves. If the data is binary,
[Byte] -> [Byte] can be used. The database will internally cast it
for Char-oriented I/O API if needed.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK
Re: simple binary IO proposition.

Reply via email to