Sun, 3 Sep 2000 11:58:40 -0700, John Meacham <[EMAIL PROTECTED]> pisze:
> * program determinism (soundness): the goal is to write one program,
> guaranteed to produce the exact same byte output for a given byte
> stream input regardless of compiler/platform it is run on. if
> this is impossible for a given platform, the compilation should
> die with an error,
With my proposal such programs are possible but not enforced. One can
make each Handle he uses binary + no conversion.
I doubt that there is a good way of determining if OS really does
what it was asked for, so there is no place for compilation errors.
I gave up trying to fit text/binary translation into the conversion
framework and instead made setting both attributes in same places.
> * IO to/from externally defined byte formats: XDR encoded files,
> RIFF files, network protocols, odd character encodings, people
> need to be able to read and write these in a standard way to files
> as well as things like sockets and pipes.
OK. It will only be a bit inefficient for multibyte binary numbers
unless the API exposes internal structures that are quite incompatible
with Haskell98 (pointers into C arrays).
> * there is no Byte type for people writing new libraries, each
> person who writes a library which works on Byte streams must come
> up with their own kludge,
In GHC there is Word8, but to use it with files one has to translate
data by (fromIntegral . ord). Raw I/O currently produces data as Chars
'\0'..'\xFF'.
> a common solution is to assume a Char is a byte, which is
> not true, the language specifies that a Char is a unicode
> encoded character, which arbitrary binary data is not, this
> leads to confusion and programming errors because the type
> information is lost, there is nothing to distinguish a raw
> UTF8 encoded byte stream from an actual Haskell string,
Right. But UTF-8 data are not supposed to be used by programs directly.
When they operate on texts, they do it in the native Haskell format:
Unicode wide characters. The text was already converted on input and
will be converted back on output.
"Incorrect" UTF-8 in Chars is seen only by the conversion engine and
sometimes by low-level I/O code.
OTOH having Bytes separate from Chars would lead to duplication of
interfaces (and implementation) and additional parametrization of
conversion functions by types of input and output characters.
I will check how would it look like. Maybe it's not that bad...
Your view is focused on binary files: files consist of bytes which may
be interpreted as text if needed. My view is focused on text files:
files consist of characters (which may be physically represented in
various ways) which may be interpreted as binary data if needed.
Anyway, the binary I/O API would be an additional API - not an
underlying API. Efficient handling of conversions implemented in C
requires integrating conversions with I/O on a quite low level.
> On many platforms Chars will be 32 bits, this results
> in a 4 fold increase in space needed for evaluated byte streams.
Haskell lists already have an overhead. In GHC a list of bytes takes
exactly the same amount of memory as a list of Chars.
Moreover, Word8 in GHC and Hugs is currently implemented as a 32/64
bit word. It does not matter for standalone objects as long as they
behave correctly.
> often you use Byte streams in areas which have nothing at all to
> do with character or string encoding, the use of String and [Char]
> in those cases would be confusing to new and experienced users.
Right.
> if a database API has a function lookup :: [Char] -> [Char]
> can it only work on strings? or arbitrary byte streams? what
> is the character encoding used in the database if i want to
> access it from C?
The encoding is not known for [Byte] -> [Byte] as well.
If the data is textual, it should be exposed in Unicode strings
so clients don't translate it themselves. If the data is binary,
[Byte] -> [Byte] can be used. The database will internally cast it
for Char-oriented I/O API if needed.
--
__("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
\__/
^^ SYGNATURA ZASTĘPCZA
QRCZAK