On Sun, 23 Jan 2011 23:52:41 -0500
Matthew Mondor <mm_li...@pulsar-zone.net> wrote:

> I guess that another possibility (which could offer a less complex and
> more efficient interface than the SBCL way) would be for ECL to
> automatically and transparently return invalid UTF-8 sequence octets as
> remapped in an unassigned range, and also at UTF-8 output transparently
> output those back characters of that range to the litteral octets,
> while still letting the application potentially deal with that range of
> "characters" as it wants, as long as it's documented...

This example describes better what I was saying above (attached).
-- 
Matt
;;; This is example-code only, but explains what I meant in my earlier post.
;;;
;;; Initially I thought about using characters in the range 1FFF00 to 1FFFFF
;;; because they are invalid unicode.  However, both ECL and SBCL currently
;;; explicitely limit the characters to be within standard unicode limits,
;;; with CHAR-CODE-LIMIT set at #x11000.
;;; On SBCL, a condition of type TYPE-ERROR is signaled if CODE-CHAR is given
;;; a number larger than 1114111, and ECL returns NIL (which is closer to what
;;; the HyperSpec suggests).  I then tought about using the range from 10FE00
;;; to 10FEFF which are defined by the unicode standard to be a range reserved
;;; for private internal use and which ECL accepts.  A potential problem might
;;; be that even if those aren't recommended for external use, some streams
;;; still could contain them.  Another possibility would be to make ECL accept
;;; more characters upwards in an invalid region, and to reserve these for
;;; this.
;;;
;;; Some implementations already appear to be using the invalid codepoints
;;; in the range used for UTF-16 surrogates (D800–DBFF, DC00–DFFF), such as
;;; DC80-DCFF.  This example also uses this range.
;;;
;;;-On UTF-8 input decoding, decoding errors should simply push the invalid
;;; octets as special characters using the MAKE-INVAL-SEQ-OCTET macro.
;;;-On UTF-8 output encoding, characters matching INVAL-SEQ-OCTET-P should
;;; be output back as litteral 8-bit octets.
;;; This allows the original stream to remain unmodified by conversions if
;;; wanted by user code, and also avoids having to signal an error on every
;;; invalid UTF-8 sequence along with a restart to permit user code to recover
;;; (which is what SBCL currently does).
;;;-User code may as wanted itself verify for invalid sequences by using the
;;; INVAL-SEQ-OCTET-P and GET-INVAL-SEQ-OCTET macros on input characters, if
;;; it cares.  It may then eliminate those, convert them to wanted characters
;;; (such as ISO-8859 characters), etc.  It may also optionally print them
;;; in a way to display them to screen as ISO-8859 characters without altering
;;; the internal representation.  Conversion is as simple as passing an input
;;; character through MAP-INVALID-SEQ-ISO-8859.
;;;-The CL implementation itself could perform such conversion transparently
;;; when printing that range of characters to an UTF-8 external format stream
;;; with an option such as *PRINT-INVAL-SEQ-ISO8859* set to T.

(defconstant +inval-seq-min+ #xDC00)
(defconstant +inval-seq-max+ #xDCFF)
(defconstant +inval-seq-mask+ #xFF)

(defmacro inval-seq-octet-p (c)
  `(>= +inval-seq-max+ (char-code ,c) +inval-seq-min+))

(defmacro get-inval-seq-octet (c)
  `(logand (char-code ,c) +inval-seq-mask+))

(defmacro make-inval-seq-octet (n)
  `(code-char (logior +inval-seq-min+ (logand ,n +inval-seq-mask+))))

(defmacro map-invalid-seq-iso-8859 (c)
  (let ((ch (gensym)))  ; Only evaluate C once
    `(let ((,ch ,c))
       (if (inval-seq-octet-p ,ch)
           (code-char (get-inval-seq-octet ,ch))
           ,ch))))
------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list

Reply via email to