Oh, and whether or not Java regular expressions let you specify ranges of
such characters outside the BMP, I have no idea.  I would expect there to
be odd behavior in that area of Java's regular expression implementation,
but haven't done extensive testing myself to find out.  I would recommend
that you do not rely on any behavior you have not tested extensively
yourself there.

Andy

On Sun, Aug 9, 2015 at 9:22 AM, Andy Fingerhut <andy.finger...@gmail.com>
wrote:

> Java uses UTF-16 encoding in memory for String objects.  Characters in the
> Basic Multilingual Plane are represented as a single 16-bit character in
> memory, but anything outside the BMP is represented as a sequence of 2
> 16-bit characters.  Clojure's \u<hex number> syntax can only be used to
> directly represent a 16-bit character.
>
> To represent characters outside the BMP, you can either use two \u<hex
> number> sequences, doing the UTF-16 encoding yourself by hand, or you can
> use a Java function like (Character/toChars 0x20000) to get a Java array of
> characters for Unicode code point 0x20000, or (String. (Character/toChars
> 0x20000)) to get a string.
>
> Andy
>
> On Sun, Aug 9, 2015 at 8:48 AM, 良ϖ <p.de.bois...@gmail.com> wrote:
>
>> I've come on some trouble when parsing an Unicode character with
>> Clojure. I know it's likely to be a problem related to Java and not
>> Clojure itself but I'm looking for a Clojurish solution so that's why
>> I'm posting it here. FYI, I have a GNU / Linux OS on the top on which
>> I use emacs 24 in cunjunction with CIDER 0.10.0snapshot (package:
>> 20150710.1304), Java 1.8.0_51, Clojure 1.6.0 and nREPL 0.2.6.
>>
>> The first character of the Unicode block "CJK Unified Ideographs
>> Extension B" is 𠀀 (hope you can properly read it, get a Chinese font
>> otherwise). Emacs perfectly deals with it but in gedit, it's like this
>> character would have the glyph you see (something like ㄛ but more
>> angular) plus a negative space. In emacs it's displayed properly but
>> when it comes to be evaluated, the behaviour is weird:
>>
>> ``` Clojure
>> 華文.core> (clojure.string/split "a𠀀a" #"\𠀀")
>> ; => ["a" "a"]
>> 華文.core> (clojure.string/split "a𠀀a" #"\u20000")
>> ["a𠀀a"]
>> 華文.core> (clojure.string/split "a𠀀a" #"[\u20000-\u2a6df]") ; it spans
>> over Extension B
>> ; => ["" "𠀀"]
>> ```
>>
>> Moreover:
>>
>> ``` Clojure
>> 華文.core> \u20000
>> ; => IllegalArgumentException Invalid unicode character: \u20000
>> clojure.lang.LispReader.readUnicodeChar
>> 華文.core> (int \𠀀)
>> ; => RuntimeException Unsupported character: \𠀀
>> clojure.lang.Util.runtimeException (Util.java:221)
>> 華文.core> (format "%04x" (int \u3403))
>> ; => "3403"
>> 華文.core> (format "%04x" (int \u20000))
>> ; => IllegalArgumentException Invalid unicode character: \u20000
>> clojure.lang.LispReader.readUnicodeChar
>> ```
>>
>> Finally here is a very annoying side-effect, just like an overflow:
>> from 20000 it overlaps values from 0, so the whole legacy ASCII would
>> be contained is this block.
>>
>> ``` Clojure
>> 華文.core> (clojure.string/split "cabac" #"[\u20000-\u2a6df]")
>> ; => []
>> 華文.core> (clojure.string/split "cabac" #"[a-b]")
>> ; => []
>> ```
>>
>> Then I don't really know how I could handle this character. I've
>> picked haphazardly some characters and it seems to be the same mess
>> above \u9999 :/
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to clojure+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to