On 27 Apr 2007 at 11:13, Dan Bron said:

> How can I use  rxmatches_jregex_  to match against unicode?  I'm ignorant
> (more or less) of Unicode and its various representations.
> 
> What I have is a bunch of files that appear to use two bytes to encode
> every character.  The first byte is always 0.  One (only one) of the files
> begins with the byte sequence  255 254  (which I take to be a BOM).  

Are you sure about that?

If the first byte is always 0, the data would appear to be in UTF-16-BE 
format: each pair of bytes represents one Unicode character, in "big-
endian" format.

However, the 255 254 or hex FF,FE is the "little-endian" order for BOM, 
which is U+FEFF.

> My goal is to us rxmatches on the files, transform the matches using a J
> verb, and write the transformed results back (in whatever the native
> encoding of these files is).  How can I do that?    Right now, if I run 
> rxmatches  I get no results. If I run  rxmatches@:-.&({.a.)  I get the
> expected results, but then the transformed results will be in ASCII, not
> this  0,ASCII  pair thing.  
> 
> Of course, I could use  rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) )  but
> that seems inelegant and if ever there's a character that isn't  0,ASCII 
> it'll be wrong.  In fact, it'll mess up right off the bat, on the BOM.  
> 
> Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so
> I'd like to leverage that, if I can, and make my code ignorant of the fact
> that these files aren't ASCII (the parts I'm matching against are GUIDs
> and so only contain the characters [0-9a-fA-F]).
> 
> So, could someone give me guidance on matching against unicode?  I know
> about  rxutf8  but it doesn't appear to help.  I'll happy to use  u:  if
> neccesary, but the "obvious" approach,  &.(1&u:@:(6&u:))  doesn't work
> because apparently  6&u:  is not invertible.

J doesn't seem to support creation of UTF-16 data directly. 6&u: should 
indeed convert your pairs of bytes to Unicode characters, though it 
appears to expect them in little-endian order (i.e. if you're correct 
about them being big-endian, you'll have to swap the pairs).

More or less any application that reads Unicode should be able to cope 
with UTF-8, so you should probably at least try that for output.

If it doesn't work, then  3&u: will convert your Unicode wchars to 
numbers, which you can then split with 256&#.^:_1 before using the result 
to select bytes from a. - they'll come out little-endian, so swap them 
back if necessary before writing out.

These two verbs should take a string of Unicode widechars and emit bytes 
in UTF-16.

utf16be =. [:,a.{~256#.(^:_1)3 u:]
utf16le =. [:,[:|."1 a.{~256(#.^:_1)3 u:]

   a. i. utf16le 4 u: 16bfeff
255 254
   a. i. utf16be 4 u: 16bfeff
254 255

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to