Matthew Mondor <mm_li...@pulsar-zone.net> writes: > On Mon, 21 Oct 2013 12:24:50 +0200 > "Pascal J. Bourguignon" <p...@informatimago.com> wrote: > >> When reading utf-8 or other unicode streams, invalid byte sequences can >> signal errors, be substituted by a given character, or be encoded into >> application reseved code points to be able to transparently transmit the >> invalid byte sequence. Cf. clisp :INPUT-ERROR-ACTION parameter of >> ext:make-encoding (clisp encodings are external-format values). >> http://clisp.org/impnotes/encoding.html#make-encoding > > I agree with the above, and it's currently possible in ECL to handle > UTF-8 decoding errors (ext:stream-decoding-error) with access to the > octets of the invalid sequence (ext:character-decoding-error-octets), > with an available restart (invoke-restart 'use-value ...). Thus an > application is free to also recode the invalid octets to LATIN or to > implement "UTF8-B" at its discretion, if it implements its own input > and output. > > The advantage of native modes such as UTF8-B or UTF8-LATIN-1 etc would > be performance and simplicity in cases where this is wanted, but the > default UTF-8 streams would continue to explicitely signal decoding > errors, definitely. > > If you also mean that CLisp can also optionally do such conversions > transparently on request (or that its interface allows user code to do > this more efficiently), that's a good thing to know and I should look > at its implementation for ideas on the way it presents that interface. > I've added to my notes the link above, thanks a lot for your answer.
That's not the case, but it could be an additionnal option to the :input-error-action parameter of make-encoding. In clisp the current options are: The :INPUT-ERROR-ACTION argument specifies what happens when an invalid byte sequence is encountered while converting bytes to characters. Its value can be :ERROR, :IGNORE or a character to be used instead. The UNICODE character #\uFFFD is typically used to indicate an error in the input sequence. There are several unicode private use areas: http://en.wikipedia.org/wiki/Private_Use_%28Unicode%29#Private_Use_Areas So we could add `(:invalid-map ,first-character-or-code) as possible value for :input-error-action, with the constraint that (<= 0 (typecase first-character-or-code (character (char-code first-character-or-code)) (unsigned-byte first-character-or-code) (t -1)) (- char-code-limit 256)) (There could be some additionnal constraints on the range, if there were missing characters like in eg. ccl: clall -r '(length (loop for i below char-code-limit unless (ignore-errors (code-char i)) collect i))' Armed Bear Common Lisp --> 0 Clozure Common Lisp --> 2050 CLISP --> 0 CMU Common Lisp --> 0 ECL --> 0 SBCL --> 0 ). To use private areas, the user could restrict first-character-or-code be of type: `(or (integer #xe000 ,(- #xf8ff -1 256)) (integer #xf0000 ,(- #xffffd -1 256)) (integer #x100000 ,(- #x10fffd -1 256))) but we can allow any character range. 0 would be useful to consider invalid byte sequences as encoded in iso-8859-1. An additionnal :output-invalid-map parameter would specify the same for the reverse on output. Finally, another option to :input-error-action could be to take a (function (input-stream octet-vector) (values &optional (or character string))) octet-vector being a vector of invalid bytes read so far from the stream. The function could further read bytes, and either build and return a character or string, return no value (ignoring the read bytes), or signal an error. :output-invalid-map could take a (function (output-stream unencodable-character)) that could do whatever it wants to encode, replace or ignore the unencodable-character on the output-stream. So I guess :utf-8b could be defined as designing: (make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0)) or equivalently: (make-encoding :charset :utf-8 :input-error-action (lambda (stream octet-vector) (map 'string (function code-char) octet-vector))) -- __Pascal Bourguignon__ http://www.informatimago.com/ ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk _______________________________________________ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list