On 27 Apr 2007 at 11:13, Dan Bron said:
> How can I use rxmatches_jregex_ to match against unicode? I'm ignorant
> (more or less) of Unicode and its various representations.
>
> What I have is a bunch of files that appear to use two bytes to encode
> every character. The first byte is always 0. One (only one) of the files
> begins with the byte sequence 255 254 (which I take to be a BOM).
Are you sure about that?
If the first byte is always 0, the data would appear to be in UTF-16-BE
format: each pair of bytes represents one Unicode character, in "big-
endian" format.
However, the 255 254 or hex FF,FE is the "little-endian" order for BOM,
which is U+FEFF.
> My goal is to us rxmatches on the files, transform the matches using a J
> verb, and write the transformed results back (in whatever the native
> encoding of these files is). How can I do that? Right now, if I run
> rxmatches I get no results. If I run rxmatches@:-.&({.a.) I get the
> expected results, but then the transformed results will be in ASCII, not
> this 0,ASCII pair thing.
>
> Of course, I could use rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) ) but
> that seems inelegant and if ever there's a character that isn't 0,ASCII
> it'll be wrong. In fact, it'll mess up right off the bat, on the BOM.
>
> Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so
> I'd like to leverage that, if I can, and make my code ignorant of the fact
> that these files aren't ASCII (the parts I'm matching against are GUIDs
> and so only contain the characters [0-9a-fA-F]).
>
> So, could someone give me guidance on matching against unicode? I know
> about rxutf8 but it doesn't appear to help. I'll happy to use u: if
> neccesary, but the "obvious" approach, &.(1&u:@:(6&u:)) doesn't work
> because apparently 6&u: is not invertible.
J doesn't seem to support creation of UTF-16 data directly. 6&u: should
indeed convert your pairs of bytes to Unicode characters, though it
appears to expect them in little-endian order (i.e. if you're correct
about them being big-endian, you'll have to swap the pairs).
More or less any application that reads Unicode should be able to cope
with UTF-8, so you should probably at least try that for output.
If it doesn't work, then 3&u: will convert your Unicode wchars to
numbers, which you can then split with 256&#.^:_1 before using the result
to select bytes from a. - they'll come out little-endian, so swap them
back if necessary before writing out.
These two verbs should take a string of Unicode widechars and emit bytes
in UTF-16.
utf16be =. [:,a.{~256#.(^:_1)3 u:]
utf16le =. [:,[:|."1 a.{~256(#.^:_1)3 u:]
a. i. utf16le 4 u: 16bfeff
255 254
a. i. utf16be 4 u: 16bfeff
254 255
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm