On Sun, 23 Jan 2011 23:52:41 -0500
Matthew Mondor <mm_li...@pulsar-zone.net> wrote:
> I guess that another possibility (which could offer a less complex and
> more efficient interface than the SBCL way) would be for ECL to
> automatically and transparently return invalid UTF-8 sequence octets as
> remapped in an unassigned range, and also at UTF-8 output transparently
> output those back characters of that range to the litteral octets,
> while still letting the application potentially deal with that range of
> "characters" as it wants, as long as it's documented...
This example describes better what I was saying above (attached).
--
Matt
;;; This is example-code only, but explains what I meant in my earlier post.
;;;
;;; Initially I thought about using characters in the range 1FFF00 to 1FFFFF
;;; because they are invalid unicode. However, both ECL and SBCL currently
;;; explicitely limit the characters to be within standard unicode limits,
;;; with CHAR-CODE-LIMIT set at #x11000.
;;; On SBCL, a condition of type TYPE-ERROR is signaled if CODE-CHAR is given
;;; a number larger than 1114111, and ECL returns NIL (which is closer to what
;;; the HyperSpec suggests). I then tought about using the range from 10FE00
;;; to 10FEFF which are defined by the unicode standard to be a range reserved
;;; for private internal use and which ECL accepts. A potential problem might
;;; be that even if those aren't recommended for external use, some streams
;;; still could contain them. Another possibility would be to make ECL accept
;;; more characters upwards in an invalid region, and to reserve these for
;;; this.
;;;
;;; Some implementations already appear to be using the invalid codepoints
;;; in the range used for UTF-16 surrogates (D800âDBFF, DC00âDFFF), such as
;;; DC80-DCFF. This example also uses this range.
;;;
;;;-On UTF-8 input decoding, decoding errors should simply push the invalid
;;; octets as special characters using the MAKE-INVAL-SEQ-OCTET macro.
;;;-On UTF-8 output encoding, characters matching INVAL-SEQ-OCTET-P should
;;; be output back as litteral 8-bit octets.
;;; This allows the original stream to remain unmodified by conversions if
;;; wanted by user code, and also avoids having to signal an error on every
;;; invalid UTF-8 sequence along with a restart to permit user code to recover
;;; (which is what SBCL currently does).
;;;-User code may as wanted itself verify for invalid sequences by using the
;;; INVAL-SEQ-OCTET-P and GET-INVAL-SEQ-OCTET macros on input characters, if
;;; it cares. It may then eliminate those, convert them to wanted characters
;;; (such as ISO-8859 characters), etc. It may also optionally print them
;;; in a way to display them to screen as ISO-8859 characters without altering
;;; the internal representation. Conversion is as simple as passing an input
;;; character through MAP-INVALID-SEQ-ISO-8859.
;;;-The CL implementation itself could perform such conversion transparently
;;; when printing that range of characters to an UTF-8 external format stream
;;; with an option such as *PRINT-INVAL-SEQ-ISO8859* set to T.
(defconstant +inval-seq-min+ #xDC00)
(defconstant +inval-seq-max+ #xDCFF)
(defconstant +inval-seq-mask+ #xFF)
(defmacro inval-seq-octet-p (c)
`(>= +inval-seq-max+ (char-code ,c) +inval-seq-min+))
(defmacro get-inval-seq-octet (c)
`(logand (char-code ,c) +inval-seq-mask+))
(defmacro make-inval-seq-octet (n)
`(code-char (logior +inval-seq-min+ (logand ,n +inval-seq-mask+))))
(defmacro map-invalid-seq-iso-8859 (c)
(let ((ch (gensym))) ; Only evaluate C once
`(let ((,ch ,c))
(if (inval-seq-octet-p ,ch)
(code-char (get-inval-seq-octet ,ch))
,ch))))
------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires
February 28th, so secure your free ArcSight Logger TODAY!
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list