1 Sep 2000 14:00:44 GMT, Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] pisze:
Haskell assumes that Chars have a fixed encoding (Unicode). Otherwise
it would be quite impossible to have functions like isSpace or toUpper
or words.
I have a working temporary hack: module with a separate Handle wrapper
and IO functions doing conversion on the fly (a bit inefficient because
of poor buffering). My FFI library now converts strings to/from the
local multibyte encoding by default.
How should conversion error handling look like?
Here is the current state. Perhaps it should be designed better - how?
A core conversion always outputs a converted string, trying to
recover from errors as it can, plus an error flag. Errors may come
from invalid input or characters not representable in output, but
they are not distinguished formally (in a conversion between Unicode
and something only one kind of error may happen).
Text may be converted in arbitrary chunks. The result of conversion
of each chunk has its error flag. There is a second variant of direct
chunk conversion that check this flag and throws an IO error when
appropriate.
There is a concept of flushing the conversion, where the state
(in stateful encodings) gets reset and any incomplete multibyte
character at the end of the previous input is treated as an error;
normally multibyte characters split at chunk boundaries are handled
automatically. Flushing should be done after converting all chunks,
but can also be done at any time. There is a convenience function
that converts a whole string in one chunk and flushes, and a second
that throws an IO error instead of returning a flag.
IO functions usually throw the error, both input and output functions.
Except hGetContents with friends which ignore any conversion errors
by using the flag variant and not looking at the flag (they cannot
signal errors because they are lazy).
When a long chunk is being converted, it is done lazily. Even in case
the core conversion is implemented in C. Since the error flag depends
on the whole data, looking at the flag usually causes the conversion
to run to its end (unless an error was found early). So code that
does not check the flag may be more lazy - it should be obvious.
Current implementation of conversion from local encoding to Unicode
inserts a U+FFFD REPLACEMENT CHARACTER in case of error and of course
sets the flag. So the effect of a lazy cat (getContents = putStr,
or interact id) on invalid input in most locales is that it will read
as U+FFFD and throw an exception when trying to be written, because
most encodings cannot represent U+FFFD. Those that can (e.g. UTF-8
in a newer glibc that I don't have installed) will output U+FFFD.
A strict cat (e.g. a hGetChar loop that catches EOFError) will detect
errors while reading.
I am trying to find a balance between having complete control over
the error detection and simplicity of concepts that also fits into
standard Haskell libraries.
I have not integrated binary file handling but will try. Text files
should be implemented magically to just set appropriate flags on a
file instead of doing the conversion manually, but it will have to
fall back to a manual implementation when not applied to Handles.
What my framework does not provide:
- The ability to test alternative futures of a conversion starting
from a given state. The conversion state is implicit, as opposed
to C, so once a chunk is committed to conversion, its influence
on the state cannot be undone.
- Detection of exact place of an error, unless the programmer provides
one character chunks (which may degrade performance).
- Disallowing silly conversions, like attaching a conversion from
ISO-8859-2 to Unicode on stdout.
- Seeking to positions that are meaningful after conversion. I don't
alter positions so they are physical and will mess up the conversion
state in case of a stateful encoding.
What cool features it does have:
- Conversions may be provided by any part of the program. They are
not limited to a system-wide database as in C.
- Conversions may be composed. So e.g. two Haskell programs talking
over a socket can easily compress the stream (when I wrap gzip/bzip2
in the right interface).
- What could be lazy, is lazy. Processing a large file or other string
sequentially does not keep the whole in memory.
- On one hand data is buffered when possible (a conversion implemented
in C is not called separately for each character), on the other
hand splitting into chunks allows saying "now convert as much as
this part of input yields", so putStrLn will display the whole line
on the terminal immediately. Flushing is a stronger action which
may influence the output; a gzipped socket must be flushed before
we wait for the answer.
--
__(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
\__/
^^ SYGNATURA ZASTÊPCZA
QRCZAK