Re: [Ecls-list] mapping invalid octet sequence to unicode private areas or others.

Pascal J. Bourguignon Mon, 21 Oct 2013 06:48:17 -0700

Matthew Mondor <mm_li...@pulsar-zone.net>
writes:

> On Mon, 21 Oct 2013 12:24:50 +0200
> "Pascal J. Bourguignon" <p...@informatimago.com> wrote:
>
>> When reading utf-8 or other unicode streams, invalid byte sequences can
>> signal errors, be substituted by a given character, or be encoded into
>> application reseved code points to be able to transparently transmit the
>> invalid byte sequence.  Cf. clisp :INPUT-ERROR-ACTION parameter of
>> ext:make-encoding (clisp encodings are external-format values).
>> http://clisp.org/impnotes/encoding.html#make-encoding
>
> I agree with the above, and it's currently possible in ECL to handle
> UTF-8 decoding errors (ext:stream-decoding-error) with access to the
> octets of the invalid sequence (ext:character-decoding-error-octets),
> with an available restart (invoke-restart 'use-value ...).  Thus an
> application is free to also recode the invalid octets to LATIN or to
> implement "UTF8-B" at its discretion, if it implements its own input
> and output.
>
> The advantage of native modes such as UTF8-B or UTF8-LATIN-1 etc would
> be performance and simplicity in cases where this is wanted, but the
> default UTF-8 streams would continue to explicitely signal decoding
> errors, definitely.
>
> If you also mean that CLisp can also optionally do such conversions
> transparently on request (or that its interface allows user code to do
> this more efficiently), that's a good thing to know and I should look
> at its implementation for ideas on the way it presents that interface.
> I've added to my notes the link above, thanks a lot for your answer.


That's not the case, but it could be an additionnal option to the
:input-error-action parameter of make-encoding.  In clisp the current
options are:

    The :INPUT-ERROR-ACTION argument specifies what happens when an
    invalid byte sequence is encountered while converting bytes to
    characters. Its value can be :ERROR, :IGNORE or a character to be
    used instead. The UNICODE character #\uFFFD is typically used to
    indicate an error in the input sequence.


There are several unicode private use areas:
http://en.wikipedia.org/wiki/Private_Use_%28Unicode%29#Private_Use_Areas

So we could add `(:invalid-map ,first-character-or-code) as possible
value for :input-error-action, with the constraint that

 (<= 0 
     (typecase first-character-or-code
        (character      (char-code first-character-or-code))
        (unsigned-byte  first-character-or-code)
        (t              -1))
     (- char-code-limit 256))


(There could be some additionnal constraints on the range, if there were
 missing characters like in eg. ccl:

    clall -r '(length (loop for i below char-code-limit
                          unless (ignore-errors (code-char i)) collect i))'

    Armed Bear Common Lisp         --> 0
    Clozure Common Lisp            --> 2050
    CLISP                          --> 0
    CMU Common Lisp                --> 0
    ECL                            --> 0
    SBCL                           --> 0
).

To use private areas, the user could restrict first-character-or-code be
of type:

         `(or (integer #xe000   ,(- #xf8ff -1 256))
              (integer #xf0000  ,(- #xffffd -1 256))
              (integer #x100000 ,(- #x10fffd -1 256)))

but we can allow any character range.  0 would be useful to consider
invalid byte sequences as encoded in iso-8859-1.


An additionnal :output-invalid-map parameter would specify the same for
the reverse on output.


Finally, another option to :input-error-action could be to take a
(function (input-stream octet-vector) (values &optional (or character
string))) octet-vector being a vector of invalid bytes read so far from
the stream.  The function could further read bytes, and either build and
return a character or string, return no value (ignoring the read bytes),
or signal an error.

:output-invalid-map could take a 
(function (output-stream unencodable-character))
that could do whatever it wants to encode, replace or ignore the
unencodable-character on the output-stream.



So I guess :utf-8b could be defined as designing:

(make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0))

or equivalently:

(make-encoding :charset :utf-8
   :input-error-action (lambda (stream octet-vector)
                          (map 'string (function code-char) octet-vector)))

-- 
__Pascal Bourguignon__
http://www.informatimago.com/


------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list

Re: [Ecls-list] mapping invalid octet sequence to unicode private areas or others.

Reply via email to