Re: simple binary IO proposition.

2000-09-02 Thread Marcin 'Qrczak' Kowalczyk

1 Sep 2000 14:00:44 GMT, Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED] pisze:

 Haskell assumes that Chars have a fixed encoding (Unicode). Otherwise
 it would be quite impossible to have functions like isSpace or toUpper
 or words.

I have a working temporary hack: module with a separate Handle wrapper
and IO functions doing conversion on the fly (a bit inefficient because
of poor buffering). My FFI library now converts strings to/from the
local multibyte encoding by default.

How should conversion error handling look like?

Here is the current state. Perhaps it should be designed better - how?

A core conversion always outputs a converted string, trying to
recover from errors as it can, plus an error flag. Errors may come
from invalid input or characters not representable in output, but
they are not distinguished formally (in a conversion between Unicode
and something only one kind of error may happen).

Text may be converted in arbitrary chunks. The result of conversion
of each chunk has its error flag. There is a second variant of direct
chunk conversion that check this flag and throws an IO error when
appropriate.

There is a concept of flushing the conversion, where the state
(in stateful encodings) gets reset and any incomplete multibyte
character at the end of the previous input is treated as an error;
normally multibyte characters split at chunk boundaries are handled
automatically. Flushing should be done after converting all chunks,
but can also be done at any time. There is a convenience function
that converts a whole string in one chunk and flushes, and a second
that throws an IO error instead of returning a flag.

IO functions usually throw the error, both input and output functions.
Except hGetContents with friends which ignore any conversion errors
by using the flag variant and not looking at the flag (they cannot
signal errors because they are lazy).

When a long chunk is being converted, it is done lazily. Even in case
the core conversion is implemented in C. Since the error flag depends
on the whole data, looking at the flag usually causes the conversion
to run to its end (unless an error was found early). So code that
does not check the flag may be more lazy - it should be obvious.

Current implementation of conversion from local encoding to Unicode
inserts a U+FFFD REPLACEMENT CHARACTER in case of error and of course
sets the flag. So the effect of a lazy cat (getContents = putStr,
or interact id) on invalid input in most locales is that it will read
as U+FFFD and throw an exception when trying to be written, because
most encodings cannot represent U+FFFD. Those that can (e.g. UTF-8
in a newer glibc that I don't have installed) will output U+FFFD.
A strict cat (e.g. a hGetChar loop that catches EOFError) will detect
errors while reading.

I am trying to find a balance between having complete control over
the error detection and simplicity of concepts that also fits into
standard Haskell libraries.

I have not integrated binary file handling but will try. Text files
should be implemented magically to just set appropriate flags on a
file instead of doing the conversion manually, but it will have to
fall back to a manual implementation when not applied to Handles.

What my framework does not provide:

- The ability to test alternative futures of a conversion starting
  from a given state. The conversion state is implicit, as opposed
  to C, so once a chunk is committed to conversion, its influence
  on the state cannot be undone.

- Detection of exact place of an error, unless the programmer provides
  one character chunks (which may degrade performance).

- Disallowing silly conversions, like attaching a conversion from
  ISO-8859-2 to Unicode on stdout.

- Seeking to positions that are meaningful after conversion. I don't
  alter positions so they are physical and will mess up the conversion
  state in case of a stateful encoding.

What cool features it does have:

- Conversions may be provided by any part of the program. They are
  not limited to a system-wide database as in C.

- Conversions may be composed. So e.g. two Haskell programs talking
  over a socket can easily compress the stream (when I wrap gzip/bzip2
  in the right interface).

- What could be lazy, is lazy. Processing a large file or other string
  sequentially does not keep the whole in memory.

- On one hand data is buffered when possible (a conversion implemented
  in C is not called separately for each character), on the other
  hand splitting into chunks allows saying "now convert as much as
  this part of input yields", so putStrLn will display the whole line
  on the terminal immediately. Flushing is a stronger action which
  may influence the output; a gzipped socket must be flushed before
  we wait for the answer.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTÊPCZA
QRCZAK





Re: simple binary IO proposition.

2000-09-02 Thread Joe English


Marcin 'Qrczak' Kowalczyk wrote:

 Joe English pisze:
  According to the ISO C standard, the meaning of wchar_t
  is implementation-defined.

 I know. How to convert between the default multibyte locale and
 Unicode on such systems?

As far as I can tell, there's no way to do so in Standard C
without investigating the details of each particular
implementation.  Even then it might not be possible --
I *still* can't figure out what encodings are supported
on IRIX.

It seems to me that the Standard C library routines
are only useful for programs that wish to remain completely
isolated from the details of localization.  If there's
any requirement at all for a specific encoding or character
set (such as UTF-8 or UTF-16), they seem to be pretty much
worthless as too much information is hidden.


--Joe English

  [EMAIL PROTECTED]