iirc pcre only support unicode in utf8 so that you must convert unicode to utf8
before using rxmatches_jregex_ .
Dan Bron wrote:
How can I use rxmatches_jregex_ to match against unicode? I'm ignorant (more
or less) of Unicode and its various representations.
What I have is a bunch of files that appear to use two bytes to encode every character. The first byte is always 0. One (only one) of the files begins with the byte sequence 255 254 (which I take to be a BOM).
My goal is to us rxmatches on the files, transform the matches using a J verb, and write the transformed results back (in whatever the native encoding of these files is). How can I do that? Right now, if I run rxmatches I get no results. If I run rxmatches@:-.&({.a.) I get the expected results, but then the transformed results will be in ASCII, not this 0,ASCII pair thing.
Of course, I could use rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) ) but that seems inelegant and if ever there's a character that isn't 0,ASCII it'll be wrong. In fact, it'll mess up right off the bat, on the BOM.
Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so I'd
like to leverage that, if I can, and make my code ignorant of the fact that
these files aren't ASCII (the parts I'm matching against are GUIDs and so only
contain the characters [0-9a-fA-F]).
So, could someone give me guidance on matching against unicode? I know about rxutf8 but it doesn't
appear to help. I'll happy to use u: if neccesary, but the "obvious" approach,
&.(1&u:@:(6&u:)) doesn't work because apparently 6&u: is not invertible.
-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
--
regards,
bill
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm