Re: [ack-users] searching through UTF-16 files?

Bill Ricker Fri, 24 Aug 2018 14:53:35 -0700

YES!  Sort-of, mostly.

First, I found some testfiles in UCS-2 encoding (which is almost the same
as UTF-16)
at http://www.humancomp.org/unichtm/unichtm.htm
(for anyone who wants to follow along at home but doesn't have UTF-16 files
sitting around).
(But alas they are academic fair use from presumably copyright sources,  so
we can't include them in test suite because Debian requires CC, GPL,
etc..., so if we're going to support this usage or any better usage later,
we'll need to find something with explicit license or PD/USG copyright
waiver.)


So ok, yes we can trick *ack* into processing UCS-2/UTF-16 files, even
without patching or making a feature request, but it's ugly, by injecting a
global encoding declaration for *all* file opens:

$ perl  -C '-Mopen IO=>":encoding(*UCS-2LE*)"' ~/bin/*ack* *--noenv
*'langues|wastes'
russmnvr.htm tongtwst.htm unilang.htm

tongtwst.htm
164:    The wild wolf roams the wintry *wastes*.

unilang.htm
35:L'enseignement et l'étude des *langues*


*CAVEATS*

   - if your 'ack' is somewhere other than ~/bin/ack, use that path
   instead, obviously enough.
   - note the quotes, they're required for perl to get the modifiers for
   open. (Reverse "" and '' pairing on Windows, I suspect.)
   - examples work with both Ack2 and (pre-release beta) Ack3 !
   - the* -C* is required for the UTF8 output to STDOUT be correct ...
   otherwise *étude* will have a blot instead of an e-accent.
   - I have to include *--noenv* flag, because otherwise it will apply the
   universal UCS or UTF open option to my *.ackrc* file, which of course is
   ASCII or at most UTF-8 per Locale, not UCS-2/UTF-16, and chokes.
   So if you have *-S --smart-case* etc in your *.ackrc*, you'll need to
   repeat it or -i on your commandline,
   and any* --type-set or --pager* definitions there that are needed for
   the immediate search too *etc*.
   If you don't have a .ackrc, you might not need the --noenv. (But you
   should have one!)
   - The *IO=>* prefix above is actually optional
   as -MOpen understands :encoding means IO=>:encoding ,
   but the long form is documenting use of Perl::IO so is preferred.
   (*When it only costs 4 characters ... i'll do it the long way*.)
   - If your files are true UTF-16 (meaning *with* Surrogate pairs
   designating some 32bit characters),
   It will tell you with
      UCS-2LE:no surrogates allowed
   replace UCS-2LE above with *UTF-16LE*;
   and if big-endian (00 nn 00 nn) instead of (nn 00 nn 00) , LE with *BE*.
   (Error will discuss FFFE or FFEF )
   - If files have bad codepoints, it will be a fatal error, and ack will
   not move on to the next file;
   so you will need to filter out unclean UTF/UCS if searching recursively
   or with glob wildcards.
   (either by iterating until no fatal errors, or filtering with a UTF
   validator earlier in the pipeline)
   (why? the way we've universally injected UTF-ness via the open pragma
   <https://perldoc.perl.org/open.html> has not injected permissive
   CHECK=>0 version of Unicode en/decode, because it's not an option for JFDI
   Everywhere. :-/ )
   (if we decide to add a feature to handle UTF-16/UCS-2, we should be able
   to use CHECK=>0 version to replace bad charcters with blots?)
   (Example: One of the test files at above page, the Zen bibliograph ("Most
   of a large bibliography related to Zen Buddhism, containing works in
   Chinese, Japanese, and other languages: UCS-2
   <http://www.humancomp.org/unichtm/zenbibl.htm>, UTF-8
   <http://www.humancomp.org/unichtm/zenbibl8.htm>)", claims to be UCS-2
   but has bogus surrogate pair values, and is rejected hard both as UCS-2
   (surrogate pairs not allowed) and UTF-16 (bad HI surrogate). Firefox likes
   it just fine though?!
   Sometimes we don't even get GIGO, it just chokes.)
   - the commandline search PCRE pattern is LATIN-1 (or ASCII or your
   Locale).
   I tried including a greek string *|παπια|* from *Tongue Twisters in Many
   Languages: UCS-2 <http://www.humancomp.org/unichtm/tongtwst.htm>, UTF-8
   <http://www.humancomp.org/unichtm/tongtws8.htm>* in my search pattern
   and it doesn't work, but i can search for English or French just fine. :-(
   (alas *unicode_start* in the terminal session didn't make this better
   either; i'm not sure where this gets mangled. May need to sort out unicode
   patterns with UTF8 first!)
   - I initially presumed I could just use *\N{NAME}* Named Characters to
   build a pattern matching *παπια* verbosely ala  \N{GREEK LOWER CASE PI}
   but no, for security ack creates the regex from effectively a singlequoted
   string, it never gets double-quote interpolation so \N{NAME} and some
   others don't work .
   But I *can* use numeric Unicode char names to match a Unicode word of
   pi, alpha, iota only:
          \b[\N{U+03C0}\N{U+03B1}\N{U+03B9}]+\b
   (see below for example using that; and obviously one could spell the
   word exactly that way too)
   - Even though the test files have the *BOM* (Byter Order Marker) at the
   start, I had to specify byte order explicitly with the LE or BE (as well as
   UCS-2 vs UTF-16).
   If you guess wrong, it will report something like
   UTF-16BE:Unicode character fffe is illegal at /usr/bin/ack line 739.
   UCS-2LE:Unicode character feff is illegal at /usr/bin/ack line 739.
   In which case, switch B<->L and try again!.
   (This makes me sad, I thought BOM should disambiguate it for me.  I
   guess that only works if i slurp first and then decode, which can't be
   injected, would require patching.)
   - The above means all the files in one pass of *ack* must be same
   encoding *and* same byte order.
   (You should be albe to mix UTF-8 and ASCII/Locale files.)
   (Workaround: filter mixed collection into subsets first; consider using
   xargs with lists of each type  if not sorting into folders/directories. )
   (If you don't have a Unicode validator to check what type each is and if
   it's clean, let me know and we'll think of something.)

Extra example -- Searching Greek tongue twisters :

$ perl   -C  '-Mopen ":encoding(UTF-16LE)"' ~/bin/ack --noenv '(?x: langues
| \b[\N{U+03C0}\N{U+03B1}\N{U+03B9}]+\b | wastes)'   [a-y]*.htm
tongtwst.htm
50:    Μια *παπια* μα *πια* *παπια*?
164:    The wild wolf roams the wintry *wastes*.

unilang.htm
35:L'enseignement et l'étude des *langues*

The (?x: ) wrapper sets Expanded syntax so the extra spaces aren't matched
to make the disjunction more readable.
(Oddly i didn't find the (?u:) wrapper to force the patter to be a Unicode
string was required.)
(Why [a-y]*.htm ? to avoid the zenbibl.htm file with the bad codepoint. )

*DISCUSSION*
Since we primarily position Ack as a programmer's code search / spelunking
tool, I'm not confident we'll accept a feature request to enable
UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm
tempted to try anyway; no promises!

This may go into Perl3's "Cookbook"  documentation -- it's not quite a FAQ
but it's worth recording, both to show what's possible and to show the
limits.

I hope this helps -- hit me with any follow-up questions this raises.

And thank you for a fun rabbit hole to go spelunking in !


// Bill

-- 
You received this message because you are subscribed to the Google Groups "ack 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/ack-users.
For more options, visit https://groups.google.com/d/optout.

Re: [ack-users] searching through UTF-16 files?

Reply via email to