Thu, 31 Aug 2000 15:19:03 -0700, John Meacham <[EMAIL PROTECTED]> pisze:

> the definition of Byte as a type will help out incredibly, i have
> seen to many modules that define type Byte = Char, even the Posix
> stuff in hslibs does it; the existance of a 'proper' definition
> of Byte will at least encourage people to write more standards
> compliant code..

It is not necessary to have a separate Byte type to do all things
with binary files. I was considering that too. But it may be simpler
to do everything around Chars (in the external interface).

I am working on a framework with text conversions, to be also
transparently used by input/output. Current assumptions:

There is a concept of an ongoing conversion between two streams of
characters. You feed it with a block of characters and get a converted
part. The conversion automatically maintains the state of a stateful
input or output encodings, and the public interface maintains partial
multibyte sequences at the end of a block of input.

I hope that in future every Handle will have a conversion attached
(or two for ReadWrite Handles). That way files will be automatically
recoded in a chosen way.

Some specific conversions (i.e. ways to produce new conversions,
where "conversion" means one that is already running):

- Default local input, from whatever the current locale specifies
  to Unicode. Assumes that input was encoded using characters '\0'..'\xFF'.

- Default local output, the opposite.
  
  These two will be used in FFI by default conversions between Strings
  and C strings. When something other is needed, the programmer will
  convert between [CChar] in the way he needs and use FFI to convert
  that to a C char array.

- Identity. Used for binary files: they will appear as streams of
  characters '\0'..'\xFF'.

- Most important explicit encodings (ISO-8859-x, UTF-8, more if needed),
  in both directions.

- Any convresion that icov (Unix98 / Single Unix Spec.) provides.
  It must be told whether iconv treats buffers as streams of bytes
  or wider entities - it depends on the encodings.

- Explicit handling of line terminators. For example Internet protocols
  often use "\r\n" - let it be handled by the socket. I'm not sure how
  to integrate this aspect for local files: it will be more portably
  and efficiently handled by native low-level IO functions.

- Gzip and Bzip2 compression and decompression. Want to read *.gz
  files decompressed on the fly? Compression modules will provide
  conversions for that! They work on '\0'..'\xFF' on both ends.

- Composition of two encodings. For example, after decompressing data
  should be recoded from something to Unicode.

- An adapter that takes a conversion from Unicode to something,
  a table telling which Unicode characters may be substitutes for
  which, and produces a conversion that will use right substitutions
  for characters that the original was not able to express.

It appears that some conversions work on bytes and some work on
large Unicode subsets. I see two choices: either have this explicit
in types and parametrize everything by input and output characters
(only two choices in use), or use Chars for both.

Handles currently use C character buffers. Writing begins to use
them even _before_ the stream should be converted (concurrency and
laziness require evaluating a whole block of characters before it is
serialized with concurrent writes, it is then stored in a character
buffer, and conversion must be done after serialization).

It would be silly to convert the first part of a String to a C
array, convert it back to a String to pass it to a generic conversion
function, convert it back to a C array in case it will be the default
which uses wcrtomb[1], convert the result from a C array to a String
to return it from a generic conversion function, and convert it back
to a C array to physically write it to a file.

So although the public interface will use Strings, the implementation
should provide hooks to be able to link a conversion implemented in
C into a Handle in a way that does not unnecessarily convert data
between Haskell and C.

There are choices of balance between elimination of all overhead and
simplicity. I've just abandoned the idea that a conversion has two
variants of input (coming from String or from a C array) independent of
analogous two variants of output. Mixed variants would come for example
from composition of mixed pure Haskell + pure C conversions. But
it makes 16 cases for implementing the composition! This does not
include variants for character sizes, which are specified explicitly
for each C input and C output. I found no excuse for disallowing
strange compositions, so all should be implemented.

No, composition will not be used that much to have this complexity.
Now I am experimenting with only two variants of a conversion: pure C
and pure Haskell, and composition of any other combination than two
C conversions will produce a conversion with a Haskell interface,
i.e. working on Strings.

Note that working with C conversions is generally painful because they
require a supplied buffer for output and finish when either all input
is converted (except perhaps a partial multibyte sequence at the end)
or the output buffer is full.

I also abandoned the idea of providing a single conversion in multiple
flavors, C or Haskell, where e.g. composition would choose best
combinations to provide all flavors that can be efficiently done at
all. Too complex. But this means that there will be a third variant,
Identity. This special case works well for both Haskell and C flavors
without translation.

Conversion of large Haskell strings should be lazy. There should be a
way to tell whether there were any conversion errors (some output will
always be given, even when they were errors). These two requirements
are not quite compatible, but I've managed to have laziness in case
the programmer does not look at the error flag - without separate
"lazy without errors" and "non-lazy with errors" variants. Laziness
from conversions implemented in C is obtained by unsafeInterleaveIO.

Conversion interface lives in the IO monad - although in principle
it should not be always necessary, there are several reasons to use IO.

I told the story about conversions implemented in C to show that a
conversion type parametrized by input and output character type, to
have it explicit where we work on Chars or Bytes, is not that easy.
It's possible but it would require introducing some classes, because
handling various character types is not uniform where they must be
understood by C. The picture is already complex without this.

Instead of providing separate binary mirrors of IO primitives, I think
it's simpler to pretend that everything uses Chars, and use the same
functions for streams with and without conversion. ASCII characters
may be used in byte streams without (fromIntegral . ord) conversion.

When characters or bytes are stored in a Haskell list, the space
overhead is the same. When they are passed to and from C conversions,
varying sizes must be handled anyway, but it's an implementation
detail.

A separate type for bytes is used when explicitly interfacing to C:
CChar. But not for I/O in my view.

--------
[1] I was told on linux-utf8 that I should use iconv, not ISO C
    wchar_t functions, to be portable to systems where wchar_t is not
    Unicode. I did not get an answer: which systems are these. Since
    it would require packaging our own iconv implementation, because
    it's specified only in Single Unix Spec. (not even Posix) and
    thus needs not to be present; and since there is no good way to
    determine either the local encoding nor a variant of Unicode that
    will be surely supported (there is a function for the former but
    may not be available, and the recommended code tries to guess it
    basing on language and country codes; where for the latter "UTF-8"
    seems to be widely supported, with an unnecessary overhead when
    UCS-4 could be used instead) - I am going to use ISO C wchar_t
    functions now.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK


Reply via email to