Text encodings

Marcin 'Qrczak' Kowalczyk Sun, 06 Aug 2000 13:00:59 -0700
I am designing a framework for handling text in various encodings.
Here is the current state of thoughts. Comments are welcome.

An important concept is a conversion from a sequence of characters
to another sequence of characters, usually in a different encoding.
Conversions are usually to Unicode or from Unicode, but for simplicity
all are treated equivalently. A conversion exists by itself, not by
its source and target encoding. Conversions can be composed.

The conversion should be done an arbitrary block of text at a time.
It matters how input and output are split because encodings may
be stateful (e.g. ISO-2022) and may be multi-byte. It cannot be the
whole input in one go because Handle writes are separate and of course
should be done immediately. It cannot be one character at a time for
efficiency reasons.

It follows that the conversion is generally stateful. A block of
input is supplied and the conversion produces as much of output as
it can, updating the state. Splitting of input should not influence
the concatenation of outputs. Handling a possibly partial multibyte
character at the end can be done by the conversions themselves,
to simplify clients - state is needed anyway for stateful encodings.

Apart from real conversion, a second operation is needed: to close
a conversion, i.e. to bring the output state to the proper value
(some encodings need that) and to signal an error for an incomplete
input multibyte character.

Errors are a complex story. Errors may result from input (invalid
source bytes) or output (unavailable characters), but for simplicity
they are not distinguished that way. Various reactions on errors
are needed:

- An error may raise an exception. I think it should be the default
  for functions like getChar and putChar, treated similarly to
  I/O errors.

- An error may be ignored. This is the only possibility for readFile
  and getContents which have no way to signal it because they are
  lazy. From a WWW browser or grep I expect that encoding errors are
  ignored, leaving e.g. U+FFFD standard replacement character.

- Finally, from a text editor I expect to read the file even in the
  presence of errors and warn me that there were errors so fragments
  may be corrupt.

It follows that a generic result from a conversion is always a
converted text and additionally the flag telling whether everything
went fine. The text should be produced lazily in case nobody looks at
the error flag: when a conversion is used directly by the programmer,
he should not be forced to split the text into pieces of an efficient
size, and converting a long string should run in constant memory.

Since the conversion may be stateful (with a state being uninteresting
fot the outside), and may signal errors, and is usually done near
input and output (attached to Handles), and implementation may use
foreign functions - I am putting the whole thing in the IO monad.

Conversion should be an abstract type. This allows private hooks for
efficient conversion of raw blocks of memory near Handle I/O, without
using intermediate Strings of the official interface.

The above suggests a natural representation:

data Conversion = Conv {
    convert :: String -> IO (String, Bool),
    flush   :: IO (String, Bool)}

convert and flush are public functions. A custom conversion can built
from a pair of such functions.

The Conversion type represents a conversion which is taking place,
with current state. Another concept is a conversion that can be
started on a text, starting from its initial state.

Since the only thing one can do with a conversion in the second sense
is to start a conversion in the first sense, a good type for the second
sense is IO Conversion. This is an official interface of obtaining
most conversions. The action produces a new Conversion object, ready
to begin accepting input. It could be a separate abstract type with
a separate start function, but it's simpler to have just IO Conversion.

Other public functions:

convertThrow:: Conversion -> String -> IO String
flushThrow:: Conversion -> IO String
-- Unfurtunately these is no good IOError code. Something as simple
-- as BadInput would be great! It could also work for readIO.

convertWhole:: IO Conversion -> String -> IO (String, Bool)
convertWholeThrow:: IO Conversion -> String -> IO String
-- These obtain the conversion, call convert and flush.

idConv:: IO Conversion
composeConv:: IO Conversion -> IO Conversion -> IO Conversion

from1CharStatelessConf:: (String -> (String, Bool)) -> IO Conversion
-- E.g. ISO-8859-x -> Unicode, Unicode -> ISO-8859-x, Unicode -> UTF-8.
-- Blocks are converted completely independently.

statelessConv:: (String -> (String, String, Bool)) -> IO Conversion
-- E.g. UTF-8 -> Unicode. This wrapper automaticelly handles a
-- partial multibyte sequence at the end which is returned on the
-- second element of the tuple.

statefulConv:: (String -> IO (String, Bool))
             -> IO (String, Bool)
             -> IO Conversion
-- A most general official constructor.

stdInputConv, stdOutputConv :: IO Conversion
-- Conversion between the default locale-dependent system encoding and
-- Unicode. Default on Handles, Directory operations, and generally
-- passing C strings to foreign functions.

fromUtf8, toUtf8, fromIso8859_1, toIso8859_1 :: IO Conversion
-- etc.
-- Most important encodings.

fromIconv, toIconv :: String -> IO Conversion
iconv:: String -> String -> IO Conversion
-- iconv is specified by the SUSV2 standard and available on various
-- Unices and from GNU libc. This and other system specific and
-- library wrappers can be provided in appropriate modules.

-- BTW, How are these names related:
--   XPG2, Unix98, X/Open, Single Unix Specification?

There is always a question: what a conversion should do with characters
unavailable in the target encoding. Sometimes it's simply an error, to
dangerous to try to fix. But sometimes characters should be replaced
by approximate substitutes. I know no "official" set of substitution
rules (I have made my own).

Usual conversions will not do any approximations themselves. A generic
method of improving conversions from Unicode can be made. All it needs
is an encoding-independent (but sometimes language-dependent) mapping
of possible substitutes, and knowledge which characters are available
in the target encoding. Unfortunately the latter information is hard to
obtain from the above interface. Worse: common foreign interfaces like
the default system encoding and iconv don't provide that information,
so it would be not enough to extend the above interface. I think that
a smart filter can incrementally learn which characters are unavailable
and don't supply them anymore, replacing them with approximations. It's
tricky because when there was an error, it's not known which character
caused it, so it should pass only pieces up to the first character that
it does not know whether it's available. It will not work if the target
encoding is stateful, unless flush is called in these places, which
may lead to non-optimal encoding because of unnecessary going to the
initial shift state. Unfortunately iconv does not allow backtracking,
so again extending the interface to cover backtracking is not enough.

I was experimenting with basics of the proposal. I found that a
function of the type
    (String -> (String, String, Bool))
defined in a natural lazy way (let instead of case for the recursive
call) works in constant space when nobody looks at the error flag,
but only when compiled with ghc -O. Without optimization the garbage
it produces is not immediately reclaimed. I don't know what interface
could make it better. Lazy stream conversion plus error indication
is surprisingly hard. Using case for the recursive call of course
causes stack overflow on a large input. I don't want to complicate
the interface by providing separate functions that ignore the error
flag and those that don't. This is "good enough" but requires -O
for conversion of a large string in one go.

-- 
 __("<  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK
Text encodings

Reply via email to