Look at
http://www.jsoftware.com/jwiki/Scripts/Ufread
2007/4/27, Dan Bron <[EMAIL PROTECTED]>:
How can I use rxmatches_jregex_ to match against unicode? I'm ignorant
(more or less) of Unicode and its various representations.
What I have is a bunch of files that appear to use two bytes to encode
every character. The first byte is always 0. One (only one) of the files
begins with the byte sequence 255 254 (which I take to be a BOM).
My goal is to us rxmatches on the files, transform the matches using a J
verb, and write the transformed results back (in whatever the native
encoding of these files is). How can I do that? Right now, if I
run rxmatches I get no results. If I run rxmatches@:-.&({.a.) I get the
expected results, but then the transformed results will be in ASCII, not
this 0,ASCII pair thing.
Of course, I could use rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) ) but
that seems inelegant and if ever there's a character that
isn't 0,ASCII it'll be wrong. In fact, it'll mess up right off the bat,
on the BOM.
Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so
I'd like to leverage that, if I can, and make my code ignorant of the fact
that these files aren't ASCII (the parts I'm matching against are GUIDs and
so only contain the characters [0-9a-fA-F]).
So, could someone give me guidance on matching against unicode? I know
about rxutf8 but it doesn't appear to help. I'll happy to use u: if
neccesary, but the "obvious" approach, &.(1&u:@:(6&u:)) doesn't work
because apparently 6&u: is not invertible.
-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
--
Björn Helgason, Verkfræðingur
Fugl&Fiskur ehf, Þerneyjarsund 23, Box 127
801 Grímsnes ,t-póst: [EMAIL PROTECTED]
Skype: gosiminn, gsm: +3546985532
Landslags og skrúðgarðagerð, gröfuþjónusta
http://groups.google.com/group/J-Programming
Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans
góður kennari getur stigið á tær án þess að glansinn fari af skónum
/|_ .-----------------------------------.
,' .\ / | Með léttri lund verður |
,--' _,' | Dagurinn í dag |
/ / | Enn betri en gærdagurinn |
( -. | `-----------------------------------'
| ) | (\_ _/)
(`-. '--.) (='.'=)
`. )----' (")_(")
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm