Java uses UTF-16 encoding in memory for String objects.  Characters in the
Basic Multilingual Plane are represented as a single 16-bit character in
memory, but anything outside the BMP is represented as a sequence of 2
16-bit characters.  Clojure's \u<hex number> syntax can only be used to
directly represent a 16-bit character.

To represent characters outside the BMP, you can either use two \u<hex
number> sequences, doing the UTF-16 encoding yourself by hand, or you can
use a Java function like (Character/toChars 0x20000) to get a Java array of
characters for Unicode code point 0x20000, or (String. (Character/toChars
0x20000)) to get a string.

Andy

On Sun, Aug 9, 2015 at 8:48 AM, 良ϖ <p.de.bois...@gmail.com> wrote:

> I've come on some trouble when parsing an Unicode character with
> Clojure. I know it's likely to be a problem related to Java and not
> Clojure itself but I'm looking for a Clojurish solution so that's why
> I'm posting it here. FYI, I have a GNU / Linux OS on the top on which
> I use emacs 24 in cunjunction with CIDER 0.10.0snapshot (package:
> 20150710.1304), Java 1.8.0_51, Clojure 1.6.0 and nREPL 0.2.6.
>
> The first character of the Unicode block "CJK Unified Ideographs
> Extension B" is 𠀀 (hope you can properly read it, get a Chinese font
> otherwise). Emacs perfectly deals with it but in gedit, it's like this
> character would have the glyph you see (something like ㄛ but more
> angular) plus a negative space. In emacs it's displayed properly but
> when it comes to be evaluated, the behaviour is weird:
>
> ``` Clojure
> 華文.core> (clojure.string/split "a𠀀a" #"\𠀀")
> ; => ["a" "a"]
> 華文.core> (clojure.string/split "a𠀀a" #"\u20000")
> ["a𠀀a"]
> 華文.core> (clojure.string/split "a𠀀a" #"[\u20000-\u2a6df]") ; it spans
> over Extension B
> ; => ["" "𠀀"]
> ```
>
> Moreover:
>
> ``` Clojure
> 華文.core> \u20000
> ; => IllegalArgumentException Invalid unicode character: \u20000
> clojure.lang.LispReader.readUnicodeChar
> 華文.core> (int \𠀀)
> ; => RuntimeException Unsupported character: \𠀀
> clojure.lang.Util.runtimeException (Util.java:221)
> 華文.core> (format "%04x" (int \u3403))
> ; => "3403"
> 華文.core> (format "%04x" (int \u20000))
> ; => IllegalArgumentException Invalid unicode character: \u20000
> clojure.lang.LispReader.readUnicodeChar
> ```
>
> Finally here is a very annoying side-effect, just like an overflow:
> from 20000 it overlaps values from 0, so the whole legacy ASCII would
> be contained is this block.
>
> ``` Clojure
> 華文.core> (clojure.string/split "cabac" #"[\u20000-\u2a6df]")
> ; => []
> 華文.core> (clojure.string/split "cabac" #"[a-b]")
> ; => []
> ```
>
> Then I don't really know how I could handle this character. I've
> picked haphazardly some characters and it seems to be the same mess
> above \u9999 :/
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to