Java uses UTF-16 encoding in memory for String objects. Characters in the Basic Multilingual Plane are represented as a single 16-bit character in memory, but anything outside the BMP is represented as a sequence of 2 16-bit characters. Clojure's \u<hex number> syntax can only be used to directly represent a 16-bit character.
To represent characters outside the BMP, you can either use two \u<hex number> sequences, doing the UTF-16 encoding yourself by hand, or you can use a Java function like (Character/toChars 0x20000) to get a Java array of characters for Unicode code point 0x20000, or (String. (Character/toChars 0x20000)) to get a string. Andy On Sun, Aug 9, 2015 at 8:48 AM, 良ϖ <p.de.bois...@gmail.com> wrote: > I've come on some trouble when parsing an Unicode character with > Clojure. I know it's likely to be a problem related to Java and not > Clojure itself but I'm looking for a Clojurish solution so that's why > I'm posting it here. FYI, I have a GNU / Linux OS on the top on which > I use emacs 24 in cunjunction with CIDER 0.10.0snapshot (package: > 20150710.1304), Java 1.8.0_51, Clojure 1.6.0 and nREPL 0.2.6. > > The first character of the Unicode block "CJK Unified Ideographs > Extension B" is 𠀀 (hope you can properly read it, get a Chinese font > otherwise). Emacs perfectly deals with it but in gedit, it's like this > character would have the glyph you see (something like ㄛ but more > angular) plus a negative space. In emacs it's displayed properly but > when it comes to be evaluated, the behaviour is weird: > > ``` Clojure > 華文.core> (clojure.string/split "a𠀀a" #"\𠀀") > ; => ["a" "a"] > 華文.core> (clojure.string/split "a𠀀a" #"\u20000") > ["a𠀀a"] > 華文.core> (clojure.string/split "a𠀀a" #"[\u20000-\u2a6df]") ; it spans > over Extension B > ; => ["" "𠀀"] > ``` > > Moreover: > > ``` Clojure > 華文.core> \u20000 > ; => IllegalArgumentException Invalid unicode character: \u20000 > clojure.lang.LispReader.readUnicodeChar > 華文.core> (int \𠀀) > ; => RuntimeException Unsupported character: \𠀀 > clojure.lang.Util.runtimeException (Util.java:221) > 華文.core> (format "%04x" (int \u3403)) > ; => "3403" > 華文.core> (format "%04x" (int \u20000)) > ; => IllegalArgumentException Invalid unicode character: \u20000 > clojure.lang.LispReader.readUnicodeChar > ``` > > Finally here is a very annoying side-effect, just like an overflow: > from 20000 it overlaps values from 0, so the whole legacy ASCII would > be contained is this block. > > ``` Clojure > 華文.core> (clojure.string/split "cabac" #"[\u20000-\u2a6df]") > ; => [] > 華文.core> (clojure.string/split "cabac" #"[a-b]") > ; => [] > ``` > > Then I don't really know how I could handle this character. I've > picked haphazardly some characters and it seems to be the same mess > above \u9999 :/ > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with > your first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.