Thanks for you reply, Bertini!

I'm also a factor newbie, much younger than you(began to use factor
yesterday in fact) :)
I got an unicode issue recently and it is the reason that I want to
have a try on *unicode.normalize* vocabulary.

The issue is that when submit a form containing a value "𝅘𝅥𝅮"(U+1D160),
different browser will encode it differently.
IE8(Windows 7, 64bit) will encode it as "%F0%9D%85%A0"(encode U+1D160
using utf-8 and then *sprintf "%%%02X"* each octet).
Chrome(Ubuntu 12.04LTS, 64bit) will encode it as
"%F0%9D%85%98%F0%9D%85%A5%F0%9D%85%AE"(encode U+1D158 U+1D165 and
U+1D16E using utf-8 and then *percent* each octet).

These three characters "𝅘"(U+1D158) "𝅥 "(U+1D165) and "𝅮"(U+1D16E)
seems to have a relationship with "𝅘𝅥𝅮"(U+1D160):
* They are the components of the latter.
* They will be render as one unit if they appear continuously(i.e. "𝅘𝅥𝅮").
They will be render as three *holes* in IE8(perhaps the reason is
missing the suitable font) but I can't select the individual hole. The
only thing I could do is selecting all the three holes.
In Chrome and Firefox, it(or them) looks like the ordinary
"𝅘𝅥𝅮"(U+1D160) but I can't select the individual character too.

Typing them in Emacs is interesting.
The first character "𝅘"(U+1D158) occupy one character width.
After input the second "𝅘𝅥𝅥 (U+1D158 and U+1D165)" the caret won't
move and these two will looks like a ordinary "𝅘𝅥"(U+1D15F) occuping
one character width.
After the final one "𝅘𝅥𝅮"(U+1D158 U+1D165 and U+1D16E) the
situation is the same: it looks like "𝅘𝅥𝅮"(U+1D160) and occupy one
character width.
The difference between Emacs and Chrome is that I can select the
individual character now(type them in Emacs directly or copy and paste
from Chrome).

My provisional conclusion is that it have something to do with
normalization and had try on some libraries(Two in Clojure, two in
Haskell and the *unicode.normalize* in Factor).
All of them could compose letters and accents, i.e. "n"(U+6E) and
"̃"(U+303) to "ñ"(U+F1) and decompose letters with accents.
All of them could decompose "𝅘𝅥𝅮"(U+1D160) to "𝅘"(U+1D158) "𝅥
"(U+1D165) and "𝅮"(U+1D16E) but can't compose them.

The the next library I plain to have a try is
(http://docs.python.org/2/library/unicodedata.html).
As a last resort, if it won't work, I have to parse
(http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) manually.

Another interesting thing is that Chrome won't decompose letters with
accents when encode them.
Maybe I go into the wrong direction? Would you give me some advice?

Thanks again!

------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to