I am designing a framework for handling text in various encodings. Here is the current state of thoughts. Comments are welcome. An important concept is a conversion from a sequence of characters to another sequence of characters, usually in a different encoding. Conversions are usually to Unicode or from Unicode, but for simplicity all are treated equivalently. A conversion exists by itself, not by its source and target encoding. Conversions can be composed. The conversion should be done an arbitrary block of text at a time. It matters how input and output are split because encodings may be stateful (e.g. ISO-2022) and may be multi-byte. It cannot be the whole input in one go because Handle writes are separate and of course should be done immediately. It cannot be one character at a time for efficiency reasons. It follows that the conversion is generally stateful. A block of input is supplied and the conversion produces as much of output as it can, updating the state. Splitting of input should not influence the concatenation of outputs. Handling a possibly partial multibyte character at the end can be done by the conversions themselves, to simplify clients - state is needed anyway for stateful encodings. Apart from real conversion, a second operation is needed: to close a conversion, i.e. to bring the output state to the proper value (some encodings need that) and to signal an error for an incomplete input multibyte character. Errors are a complex story. Errors may result from input (invalid source bytes) or output (unavailable characters), but for simplicity they are not distinguished that way. Various reactions on errors are needed: - An error may raise an exception. I think it should be the default for functions like getChar and putChar, treated similarly to I/O errors. - An error may be ignored. This is the only possibility for readFile and getContents which have no way to signal it because they are lazy. From a WWW browser or grep I expect that encoding errors are ignored, leaving e.g. U+FFFD standard replacement character. - Finally, from a text editor I expect to read the file even in the presence of errors and warn me that there were errors so fragments may be corrupt. It follows that a generic result from a conversion is always a converted text and additionally the flag telling whether everything went fine. The text should be produced lazily in case nobody looks at the error flag: when a conversion is used directly by the programmer, he should not be forced to split the text into pieces of an efficient size, and converting a long string should run in constant memory. Since the conversion may be stateful (with a state being uninteresting fot the outside), and may signal errors, and is usually done near input and output (attached to Handles), and implementation may use foreign functions - I am putting the whole thing in the IO monad. Conversion should be an abstract type. This allows private hooks for efficient conversion of raw blocks of memory near Handle I/O, without using intermediate Strings of the official interface. The above suggests a natural representation: data Conversion = Conv { convert :: String -> IO (String, Bool), flush :: IO (String, Bool)} convert and flush are public functions. A custom conversion can built from a pair of such functions. The Conversion type represents a conversion which is taking place, with current state. Another concept is a conversion that can be started on a text, starting from its initial state. Since the only thing one can do with a conversion in the second sense is to start a conversion in the first sense, a good type for the second sense is IO Conversion. This is an official interface of obtaining most conversions. The action produces a new Conversion object, ready to begin accepting input. It could be a separate abstract type with a separate start function, but it's simpler to have just IO Conversion. Other public functions: convertThrow:: Conversion -> String -> IO String flushThrow:: Conversion -> IO String -- Unfurtunately these is no good IOError code. Something as simple -- as BadInput would be great! It could also work for readIO. convertWhole:: IO Conversion -> String -> IO (String, Bool) convertWholeThrow:: IO Conversion -> String -> IO String -- These obtain the conversion, call convert and flush. idConv:: IO Conversion composeConv:: IO Conversion -> IO Conversion -> IO Conversion from1CharStatelessConf:: (String -> (String, Bool)) -> IO Conversion -- E.g. ISO-8859-x -> Unicode, Unicode -> ISO-8859-x, Unicode -> UTF-8. -- Blocks are converted completely independently. statelessConv:: (String -> (String, String, Bool)) -> IO Conversion -- E.g. UTF-8 -> Unicode. This wrapper automaticelly handles a -- partial multibyte sequence at the end which is returned on the -- second element of the tuple. statefulConv:: (String -> IO (String, Bool)) -> IO (String, Bool) -> IO Conversion -- A most general official constructor. stdInputConv, stdOutputConv :: IO Conversion -- Conversion between the default locale-dependent system encoding and -- Unicode. Default on Handles, Directory operations, and generally -- passing C strings to foreign functions. fromUtf8, toUtf8, fromIso8859_1, toIso8859_1 :: IO Conversion -- etc. -- Most important encodings. fromIconv, toIconv :: String -> IO Conversion iconv:: String -> String -> IO Conversion -- iconv is specified by the SUSV2 standard and available on various -- Unices and from GNU libc. This and other system specific and -- library wrappers can be provided in appropriate modules. -- BTW, How are these names related: -- XPG2, Unix98, X/Open, Single Unix Specification? There is always a question: what a conversion should do with characters unavailable in the target encoding. Sometimes it's simply an error, to dangerous to try to fix. But sometimes characters should be replaced by approximate substitutes. I know no "official" set of substitution rules (I have made my own). Usual conversions will not do any approximations themselves. A generic method of improving conversions from Unicode can be made. All it needs is an encoding-independent (but sometimes language-dependent) mapping of possible substitutes, and knowledge which characters are available in the target encoding. Unfortunately the latter information is hard to obtain from the above interface. Worse: common foreign interfaces like the default system encoding and iconv don't provide that information, so it would be not enough to extend the above interface. I think that a smart filter can incrementally learn which characters are unavailable and don't supply them anymore, replacing them with approximations. It's tricky because when there was an error, it's not known which character caused it, so it should pass only pieces up to the first character that it does not know whether it's available. It will not work if the target encoding is stateful, unless flush is called in these places, which may lead to non-optimal encoding because of unnecessary going to the initial shift state. Unfortunately iconv does not allow backtracking, so again extending the interface to cover backtracking is not enough. I was experimenting with basics of the proposal. I found that a function of the type (String -> (String, String, Bool)) defined in a natural lazy way (let instead of case for the recursive call) works in constant space when nobody looks at the error flag, but only when compiled with ghc -O. Without optimization the garbage it produces is not immediately reclaimed. I don't know what interface could make it better. Lazy stream conversion plus error indication is surprisingly hard. Using case for the recursive call of course causes stack overflow on a large input. I don't want to complicate the interface by providing separate functions that ignore the error flag and those that don't. This is "good enough" but requires -O for conversion of a large string in one go. -- __("< Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTĘPCZA QRCZAK