Oh, and whether or not Java regular expressions let you specify ranges of such characters outside the BMP, I have no idea. I would expect there to be odd behavior in that area of Java's regular expression implementation, but haven't done extensive testing myself to find out. I would recommend that you do not rely on any behavior you have not tested extensively yourself there.
Andy On Sun, Aug 9, 2015 at 9:22 AM, Andy Fingerhut <andy.finger...@gmail.com> wrote: > Java uses UTF-16 encoding in memory for String objects. Characters in the > Basic Multilingual Plane are represented as a single 16-bit character in > memory, but anything outside the BMP is represented as a sequence of 2 > 16-bit characters. Clojure's \u<hex number> syntax can only be used to > directly represent a 16-bit character. > > To represent characters outside the BMP, you can either use two \u<hex > number> sequences, doing the UTF-16 encoding yourself by hand, or you can > use a Java function like (Character/toChars 0x20000) to get a Java array of > characters for Unicode code point 0x20000, or (String. (Character/toChars > 0x20000)) to get a string. > > Andy > > On Sun, Aug 9, 2015 at 8:48 AM, 良ϖ <p.de.bois...@gmail.com> wrote: > >> I've come on some trouble when parsing an Unicode character with >> Clojure. I know it's likely to be a problem related to Java and not >> Clojure itself but I'm looking for a Clojurish solution so that's why >> I'm posting it here. FYI, I have a GNU / Linux OS on the top on which >> I use emacs 24 in cunjunction with CIDER 0.10.0snapshot (package: >> 20150710.1304), Java 1.8.0_51, Clojure 1.6.0 and nREPL 0.2.6. >> >> The first character of the Unicode block "CJK Unified Ideographs >> Extension B" is 𠀀 (hope you can properly read it, get a Chinese font >> otherwise). Emacs perfectly deals with it but in gedit, it's like this >> character would have the glyph you see (something like ㄛ but more >> angular) plus a negative space. In emacs it's displayed properly but >> when it comes to be evaluated, the behaviour is weird: >> >> ``` Clojure >> 華文.core> (clojure.string/split "a𠀀a" #"\𠀀") >> ; => ["a" "a"] >> 華文.core> (clojure.string/split "a𠀀a" #"\u20000") >> ["a𠀀a"] >> 華文.core> (clojure.string/split "a𠀀a" #"[\u20000-\u2a6df]") ; it spans >> over Extension B >> ; => ["" "𠀀"] >> ``` >> >> Moreover: >> >> ``` Clojure >> 華文.core> \u20000 >> ; => IllegalArgumentException Invalid unicode character: \u20000 >> clojure.lang.LispReader.readUnicodeChar >> 華文.core> (int \𠀀) >> ; => RuntimeException Unsupported character: \𠀀 >> clojure.lang.Util.runtimeException (Util.java:221) >> 華文.core> (format "%04x" (int \u3403)) >> ; => "3403" >> 華文.core> (format "%04x" (int \u20000)) >> ; => IllegalArgumentException Invalid unicode character: \u20000 >> clojure.lang.LispReader.readUnicodeChar >> ``` >> >> Finally here is a very annoying side-effect, just like an overflow: >> from 20000 it overlaps values from 0, so the whole legacy ASCII would >> be contained is this block. >> >> ``` Clojure >> 華文.core> (clojure.string/split "cabac" #"[\u20000-\u2a6df]") >> ; => [] >> 華文.core> (clojure.string/split "cabac" #"[a-b]") >> ; => [] >> ``` >> >> Then I don't really know how I could handle this character. I've >> picked haphazardly some characters and it seems to be the same mess >> above \u9999 :/ >> >> -- >> You received this message because you are subscribed to the Google >> Groups "Clojure" group. >> To post to this group, send email to clojure@googlegroups.com >> Note that posts from new members are moderated - please be patient with >> your first post. >> To unsubscribe from this group, send email to >> clojure+unsubscr...@googlegroups.com >> For more options, visit this group at >> http://groups.google.com/group/clojure?hl=en >> --- >> You received this message because you are subscribed to the Google Groups >> "Clojure" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to clojure+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.