Re: [Jprogramming] regex and unicode

bill lam Fri, 27 Apr 2007 08:53:43 -0700

iirc pcre only support unicode in utf8 so that you must convert unicode to utf8before using rxmatches_jregex_ .


Dan Bron wrote:

How can I use  rxmatches_jregex_  to match against unicode?  I'm ignorant (more 
or less) of Unicode and its various representations.
What I have is a bunch of files that appear to use two bytes to encode every character. The first byte is always 0. One (only one) of the files begins with the byte sequence 255 254 (which I take to be a BOM).My goal is to us rxmatches on the files, transform the matches using a J verb, and write the transformed results back (in whatever the native encoding of these files is). How can I do that? Right now, if I run rxmatches I get no results. If I run rxmatches@:-.&({.a.) I get the expected results, but then the transformed results will be in ASCII, not this 0,ASCII pair thing.Of course, I could use rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) ) but that seems inelegant and if ever there's a character that isn't 0,ASCII it'll be wrong. In fact, it'll mess up right off the bat, on the BOM.
Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so I'd 
like to leverage that, if I can, and make my code ignorant of the fact that 
these files aren't ASCII (the parts I'm matching against are GUIDs and so only 
contain the characters [0-9a-fA-F]).

So, could someone give me guidance on matching against unicode?  I know about  rxutf8  but it doesn't 
appear to help.  I'll happy to use  u:  if neccesary, but the "obvious" approach,  
&.(1&u:@:(6&u:))  doesn't work because apparently  6&u:  is not invertible.

-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm



--
regards,
bill
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] regex and unicode

Reply via email to