[Jprogramming] regex and unicode

Dan Bron Fri, 27 Apr 2007 08:23:43 -0700

How can I use  rxmatches_jregex_  to match against unicode?  I'm ignorant (more 
or less) of Unicode and its various representations.


What I have is a bunch of files that appear to use two bytes to encode every 
character.  The first byte is always 0.  One (only one) of the files begins 
with the byte sequence  255 254  (which I take to be a BOM).  

My goal is to us rxmatches on the files, transform the matches using a J verb, 
and write the transformed results back (in whatever the native encoding of 
these files is).  How can I do that?    Right now, if I run  rxmatches  I get 
no results. If I run  rxmatches@:-.&({.a.)  I get the expected results, but 
then the transformed results will be in ASCII, not this  0,ASCII  pair thing.  

Of course, I could use  rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) )  but that 
seems inelegant and if ever there's a character that isn't  0,ASCII  it'll be 
wrong.  In fact, it'll mess up right off the bat, on the BOM.  

Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so I'd 
like to leverage that, if I can, and make my code ignorant of the fact that 
these files aren't ASCII (the parts I'm matching against are GUIDs and so only 
contain the characters [0-9a-fA-F]).

So, could someone give me guidance on matching against unicode?  I know about  
rxutf8  but it doesn't appear to help.  I'll happy to use  u:  if neccesary, 
but the "obvious" approach,  &.(1&u:@:(6&u:))  doesn't work because apparently  
6&u:  is not invertible.

-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] regex and unicode

Reply via email to