Re: [ack-users] searching through UTF-16 files?

Richard Simões Sun, 02 Sep 2018 00:49:00 -0700

Oh, wow, I forgot I even posted this. My current necessity for UTF-16 
searching is admittedly an unusual situation for a programmer: I'm 
receiving CSVs to process from various collaborators, all of whom are using 
Microsoft Excel on either Windows or OS X. Amazingly, it turns out no 
version of Excel on any platform can output a CSV in UTF-8: If there are 
any characters outside the ASCII range, Excel will encode in the given 
operating system's historical proprietary encoding (i.e., Windows-1252 or 
Mac OS Roman). This inevitably lead to the collaborators corrupting their 
own files when passing them among themselves. My suggestion that everyone 
switch to LibreOffice was met with blank stares.


After some research, we discovered that Excel on all platforms can output 
tab-separated values encoded in UTF-16 (w/ BOM). LibreOffice Calc can do 
this, too: Save a file with the "Text CSV" format and tick the "Edit filter 
settings" checkbox to be presented with encoding and delimiter options. 
This solution was acceptable to everyone, including me, until the first 
time I tried to ack through one of the new UTF-16-encoded files.

On Tuesday, August 28, 2018 at 10:34:47 AM UTC-5, [email protected] wrote:
>
>
>
> On Tue, Aug 28, 2018 at 8:44 AM David Cantrell <[email protected] 
> <javascript:>> wrote:
>
>> On Fri, Aug 24, 2018 at 05:53:17PM -0400, Bill Ricker wrote:
>>
>> > *DISCUSSION*
>> > Since we primarily position Ack as a programmer's code search / 
>> spelunking
>> > tool, I'm not confident we'll accept a feature request to enable
>> > UTF-16/UCS-2 and auto-detection via BOM of 8/16/32 and BE/LE ... but I'm
>> > tempted to try anyway; no promises!
>>
>> FWIW while I do mostly use ack for grovelling over code, I also use it
>> for grovelling over documentation and config data, both of which can
>> contain non-ASCII code points.
>
>
> Likewise !   I collect a lot of out-of-copyright history/historic books as 
> PDF and txt, and export my .doc[x]/xls[x]/od[stp] files as txt, and then 
> use the tree of TXT as an index to all the PDFs etc via ack.
>
> I am however aware that using Ack as part of my ad hoc CLI Document 
> Retrieval System is an* "off label" use* of a code-search tool.
>
> How often do programmers with native language not English use accented 
> characters in their variable/subroutine/Module::method names ?
> (We may not see much of it in CPAN but it may be more prevalent in-house 
> usage?)
> How often do we need to search for emojis that aren't expressed as Hex or 
> \N{NAME} but are  visible emoji characters in the source file?
>
> (I wonder, is the shebang `#! perl` processed properly if it's preceded by 
> a UTF BOM ?  Is it LOCALE dependent?)
>
>
> Thankfully *I* only have to work (at
>> least currently) with utf8 and ascii documentation and data, but I can
>> imagine that there are poor souls out there working with other
>> encodings.
>>
>
> UTF-16/32 will be rare outside of shops processing Asian languages, where 
> there are tradeoffs, but i can see where WORD as Codepoint would have the 
> same advantage (in modern Big RAM world) that our former BYTE as Codepoint 
> had had in our innocent (=naive/guilty) days.
>
> Running across a  UTF file without BOM that had been separated from 
> metadata specification of which UTF will be problematic.
>
>> I'd love to see ack spawn something like an --encoding=utf-16 tentacle,
>> with maybe an --encoding=automatic that could be stuffed into a .ackrc.
>
>
> Inspired by above thread, i ran an experiment, with hopes of implementing 
> above suggestions.
>
> Automatic encoding detection can work reliably only when a BOM (Byte Order 
> Mark) prefix is included in the front of the file .
> Detection of encodings without a BOM is intrinsically heuristic at best.
> Binary data which may be some UTF or might not can't reliably be 
> distinguished by trying everything and catching conversion failures, there 
> will be false positives which can only be accepted/rejected by recognizing 
> plausible content, which is a natural language problem. I can't tell if a 
> series of Chinese etc ideographs is nonsense or sensible, and neither can 
> Perl. (OTOH Google Translate tells me the sequence of Chinese ideographs 
> emitted by a wrong UTF guess is nonsense.)
>
> ( Automatic BOM detection would even help with UTF8 files currently being 
> detected as ASCII. )
>
> I've added a feature request to Ack3 RT queue for --encoding=utf32 and for 
> automagic interpretation of BOM (with dis/en-able flags).
> But reading the document as Unicode opens a can of works regarding Unicode 
> REs ... when is the RE to be treated as (?u:)? when does /[c]/ match "ç" ? 
> When does /\w/ match "à á â ç è é ê ì í î ô ü µ 𝛷 𝛹 𝛳 ô 𝟇 𝝿 𝜎 τ" ?
> Not sure what the prognosis would be.
>
>

-- 
You received this message because you are subscribed to the Google Groups "ack 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/ack-users.
For more options, visit https://groups.google.com/d/optout.

Re: [ack-users] searching through UTF-16 files?

Reply via email to