On Thu, 23 May 2013 14:02:58 +0400
Stanislav Frolov <frolosof...@gmail.com> wrote:

> I have trouble with filename encoding on Linux (utf-8) and windows (cp866?).
> 
> Examples
> 
> There is one file in directory: "тест" (mean "test" in russian).
> (directory "*") => (#P"/path/to/тест")
> 
> Let's try create pathname from cyrilic utf-8 filename:
> (pathname "тест")
> Error: Cannot coerce string тест to a base-string

Unfortunately, path/file names encoding are OS-specific, file-system
specific and may be locale specific...

POSIX filenames may contain bytes which are often used to hold UTF-8
characters on filesystems which allow this, but that too is only one of
the available encoding options, and unfortunately filenames cannot be
tagged with the encoding type, except if using an uncommon convention
like is used in RFC 2047 for message headers, or non-portable
attributes/subfiles, so files named by others on their systems may not
display correctly locally on the same OS and FS).  However, because
POSIX syscalls expect C strings, UTF-8 is popular when the various
single-byte encodings are not used.

My Windows experience is limited, but I think that it usually uses
UTF-16 where unicode strings are possible.

ECL internally stores unicode strings using UCS-32, and the base-string
only accepts character codes 0-255.


This might not be the only or cleanest solution, but this might work to
create UTF-8 pathnames for POSIX systems:


(defun utf-8-base-string<-string (string)
  "Encodes the supplied STRING to an UTF-8 base-string which it returns."
  (let ((v (make-array (+ 5 (length string)) ; Best case but we might grow
                       :element-type 'base-char
                       :adjustable t
                       :fill-pointer 0)))
    (with-open-stream (s (ext:make-sequence-output-stream
                          v :external-format :utf-8))
      (loop
         for c across string
         do
           (write-char c s)
           (let ((d (array-dimension v 0)))
             (when (< (- d (fill-pointer v)) 5)
               (adjust-array v (* 2 d))))))
    v))

; (pathname (utf-8-base-string<-string "тест")) -> #P"Ñ\202еÑ\201Ñ\202"


If you need more portable encoding conversion code, the Babel CL
library also supports such (http://common-lisp.net/project/babel/).
-- 
Matt

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list

Reply via email to