Re: [ack-users] searching through UTF-16 files?

Bill Ricker Tue, 28 Aug 2018 08:35:13 -0700

On Tue, Aug 28, 2018 at 8:44 AM David Cantrell <[email protected]>
wrote:

> On Fri, Aug 24, 2018 at 05:53:17PM -0400, Bill Ricker wrote:
>
> > *DISCUSSION*
> > Since we primarily position Ack as a programmer's code search /
> spelunking
> > tool, I'm not confident we'll accept a feature request to enable
> > UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm
> > tempted to try anyway; no promises!
>
> FWIW while I do mostly use ack for grovelling over code, I also use it
> for grovelling over documentation and config data, both of which can
> contain non-ASCII code points.

Likewise !   I collect a lot of out-of-copyright history/historic books as
PDF and txt, and export my .doc[x]/xls[x]/od[stp] files as txt, and then
use the tree of TXT as an index to all the PDFs etc via ack.

I am however aware that using Ack as part of my ad hoc CLI Document
Retrieval System is an* "off label" use* of a code-search tool.

How often do programmers with native language not English use accented
characters in their variable/subroutine/Module::method names ?
(We may not see much of it in CPAN but it may be more prevalent in-house
usage?)
How often do we need to search for emojis that aren't expressed as Hex or
\N{NAME} but are  visible emoji characters in the source file?

(I wonder, is the shebang `#! perl` processed properly if it's preceded by
a UTF BOM ?  Is it LOCALE dependent?)

Thankfully *I* only have to work (at
> least currently) with utf8 and ascii documentation and data, but I can
> imagine that there are poor souls out there working with other
> encodings.
>

UTF-16/32 will be rare outside of shops processing Asian languages, where
there are tradeoffs, but i can see where WORD as Codepoint would have the
same advantage (in modern Big RAM world) that our former BYTE as Codepoint
had had in our innocent (=naive/guilty) days.

Running across a  UTF file without BOM that had been separated from
metadata specification of which UTF will be problematic.

> I'd love to see ack spawn something like an --encoding=utf-16 tentacle,
> with maybe an --encoding=automatic that could be stuffed into a .ackrc.

Inspired by above thread, i ran an experiment, with hopes of implementing
above suggestions.

Automatic encoding detection can work reliably only when a BOM (Byte Order
Mark) prefix is included in the front of the file .
Detection of encodings without a BOM is intrinsically heuristic at best.
Binary data which may be some UTF or might not can't reliably be
distinguished by trying everything and catching conversion failures, there
will be false positives which can only be accepted/rejected by recognizing
plausible content, which is a natural language problem. I can't tell if a
series of Chinese etc ideographs is nonsense or sensible, and neither can
Perl. (OTOH Google Translate tells me the sequence of Chinese ideographs
emitted by a wrong UTF guess is nonsense.)

( Automatic BOM detection would even help with UTF8 files currently being
detected as ASCII. )

I've added a feature request to Ack3 RT queue for --encoding=utf32 and for
automagic interpretation of BOM (with dis/en-able flags).
But reading the document as Unicode opens a can of works regarding Unicode
REs ... when is the RE to be treated as (?u:)? when does /[c]/ match "ç" ?
When does /\w/ match "à á â ç è é ê ì í î ô ü µ 𝛷 𝛹 𝛳 ô 𝟇 𝝿 𝜎 τ" ?
Not sure what the prognosis would be.

-- 
You received this message because you are subscribed to the Google Groups "ack 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/ack-users.
For more options, visit https://groups.google.com/d/optout.

Re: [ack-users] searching through UTF-16 files?

Reply via email to