Look at
http://www.jsoftware.com/jwiki/Scripts/Ufread

2007/4/27, Dan Bron <[EMAIL PROTECTED]>:

How can I use  rxmatches_jregex_  to match against unicode?  I'm ignorant
(more or less) of Unicode and its various representations.

What I have is a bunch of files that appear to use two bytes to encode
every character.  The first byte is always 0.  One (only one) of the files
begins with the byte sequence  255 254  (which I take to be a BOM).

My goal is to us rxmatches on the files, transform the matches using a J
verb, and write the transformed results back (in whatever the native
encoding of these files is).  How can I do that?    Right now, if I
run  rxmatches  I get no results. If I run  rxmatches@:-.&({.a.)  I get the
expected results, but then the transformed results will be in ASCII, not
this  0,ASCII  pair thing.

Of course, I could use  rxmatches&.( -.&({.a.) :. ([: , ({.a.)&,.) )  but
that seems inelegant and if ever there's a character that
isn't  0,ASCII  it'll be wrong.  In fact, it'll mess up right off the bat,
on the BOM.

Also, I'm pretty sure PCRE supports unicode natively (and efficiently), so
I'd like to leverage that, if I can, and make my code ignorant of the fact
that these files aren't ASCII (the parts I'm matching against are GUIDs and
so only contain the characters [0-9a-fA-F]).

So, could someone give me guidance on matching against unicode?  I know
about  rxutf8  but it doesn't appear to help.  I'll happy to use  u:  if
neccesary, but the "obvious" approach,  &.(1&u:@:(6&u:))  doesn't work
because apparently  6&u:  is not invertible.

-Dan
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm




--
Björn Helgason, Verkfræðingur
Fugl&Fiskur ehf, Þerneyjarsund 23, Box 127
801 Grímsnes ,t-póst: [EMAIL PROTECTED]
Skype: gosiminn, gsm: +3546985532
Landslags og skrúðgarðagerð, gröfuþjónusta
http://groups.google.com/group/J-Programming


Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans

góður kennari getur stigið á tær án þess að glansinn fari af skónum
         /|_      .-----------------------------------.
        ,'  .\  /  | Með léttri lund verður        |
    ,--'    _,'   | Dagurinn í dag                     |
   /       /       | Enn betri en gærdagurinn  |
  (   -.  |        `-----------------------------------'
  |     ) |        (\_ _/)
 (`-.  '--.)       (='.'=)
  `. )----'        (")_(")
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to