On Sun, Sep 2, 2018 at 3:48 AM Richard Simões <[email protected]> wrote:

> Oh, wow, I forgot I even posted this. My current necessity for UTF-16
> searching is admittedly an unusual situation for a programmer: I'm
> receiving CSVs to process from various collaborators, all of whom are using
> Microsoft Excel on either Windows or OS X. Amazingly, it turns out no
> version of Excel on any platform can output a CSV in UTF-8: If there are
> any characters outside the ASCII range, Excel will encode in the given
> operating system's historical proprietary encoding (i.e., Windows-1252 or
> Mac OS Roman). This inevitably lead to the collaborators corrupting their
> own files when passing them among themselves.
>

Passing spreadsheets around is problematic in the best of times but wow,
that's special.

May I ask what sort of non-ASCII in the Excel was forcing encoding of the
CSV? The usual Latin-1 mix of multinational characters, emojis, or
non-Western scripts?

It did kinda made sense historically to default to OS proprietary extended
character set for a native app, even a multi-platform native app like
Excel, but wow, it doens't have UTF-8 option yet???  Adding UTF-16 as an
option does make sense for Asian language file sharing, lucky it's there
for you!

My suggestion that everyone switch to LibreOffice was met with blank stares.
>

Sigh. yeah, folks with sunk cost both in actual $$ and training /
experience with MS Office are resistant to the idea that they can get
better for less, since it challenges the validity of their prior decisions.
And don't want to straddle two programs one to collaborate with this group
and one to collaborate with everyone else.

For *some* use-cases, Spreadsheets in the cloud can be better for shared
data than passing them around -- Office 360 or better yet Google Sheets.
Especially  if it's data collection, Google Sheets have data entry forms.
(Not appropriate for highly sensitive information, even at the "only people
with URL can see" level, of course.)

After some research, we discovered that Excel on all platforms can output
> tab-separated values encoded in UTF-16 (w/ BOM).
>

That is useful information!
( Let us be thankful it is with BOM! )

> LibreOffice Calc can do this, too: Save a file with the "Text CSV" format
> and tick the "Edit filter settings" checkbox to be presented with encoding
> and delimiter options.
>

Cool!

> This solution was acceptable to everyone, including me,
>

This is somewhat surprising, i didn't expect to see UTF-16 be useful
outside of Asian text processing!

until the first time I tried to ack through one of the new UTF-16-encoded
> files.
>

Since Ack is positioned as a programmers' code search tool, with data
search as supplementary use, and natural language search as "off label" us,
nice if it works but not really supported, we've been rather casual in our
UTF-8 support. Yes, Perl and some other languages allow use of  UTF8
encoding of source files -- which allows non-ASCII hi-bit or multi-octet
chars in variable and subroutine etc names and in string constants in code
files, not just in data files/streams -- but it's mostly worked (sometimes
requiring PERL_UNICODE=SAD enviroment prefix). This is the first time we've
run across a need to detect BOM tag for file encoding.

UTF-16 is not in good odor on Linux for historical reasons.  (I forget the
details.)  Hence LOCALE=en_US.utf-16 is not an option to force Perl to deal
with it.So the simple layer of tricks we use to force UTF8 onto Ack without
explicit decoding support won't work for UTF16. (although if one installed
an Asian UTF32 locale in one's Ubuntu etc, they might work? untried with
Ack!)

That you're searching DATA -- that you may have programs processing and so
are using Ack to peek into the data while debugging (since we debug the
GIGO data as often as we debug code!) -- will add just a little weight to
the idea of detecting BOM prefixes and doing the right thing with them
(decoding to internal).  BOM detection and processing *looks* simple to
_do_ but *not* simple to expand the test suite adequately to assure it
doesn't result interact  badly elsewhere.
// Bill

-- 
You received this message because you are subscribed to the Google Groups "ack 
users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/ack-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to