On Sun, Sep 2, 2018 at 3:48 AM Richard Simões <[email protected]> wrote:
> Oh, wow, I forgot I even posted this. My current necessity for UTF-16 > searching is admittedly an unusual situation for a programmer: I'm > receiving CSVs to process from various collaborators, all of whom are using > Microsoft Excel on either Windows or OS X. Amazingly, it turns out no > version of Excel on any platform can output a CSV in UTF-8: If there are > any characters outside the ASCII range, Excel will encode in the given > operating system's historical proprietary encoding (i.e., Windows-1252 or > Mac OS Roman). This inevitably lead to the collaborators corrupting their > own files when passing them among themselves. > Passing spreadsheets around is problematic in the best of times but wow, that's special. May I ask what sort of non-ASCII in the Excel was forcing encoding of the CSV? The usual Latin-1 mix of multinational characters, emojis, or non-Western scripts? It did kinda made sense historically to default to OS proprietary extended character set for a native app, even a multi-platform native app like Excel, but wow, it doens't have UTF-8 option yet??? Adding UTF-16 as an option does make sense for Asian language file sharing, lucky it's there for you! My suggestion that everyone switch to LibreOffice was met with blank stares. > Sigh. yeah, folks with sunk cost both in actual $$ and training / experience with MS Office are resistant to the idea that they can get better for less, since it challenges the validity of their prior decisions. And don't want to straddle two programs one to collaborate with this group and one to collaborate with everyone else. For *some* use-cases, Spreadsheets in the cloud can be better for shared data than passing them around -- Office 360 or better yet Google Sheets. Especially if it's data collection, Google Sheets have data entry forms. (Not appropriate for highly sensitive information, even at the "only people with URL can see" level, of course.) After some research, we discovered that Excel on all platforms can output > tab-separated values encoded in UTF-16 (w/ BOM). > That is useful information! ( Let us be thankful it is with BOM! ) > LibreOffice Calc can do this, too: Save a file with the "Text CSV" format > and tick the "Edit filter settings" checkbox to be presented with encoding > and delimiter options. > Cool! > This solution was acceptable to everyone, including me, > This is somewhat surprising, i didn't expect to see UTF-16 be useful outside of Asian text processing! until the first time I tried to ack through one of the new UTF-16-encoded > files. > Since Ack is positioned as a programmers' code search tool, with data search as supplementary use, and natural language search as "off label" us, nice if it works but not really supported, we've been rather casual in our UTF-8 support. Yes, Perl and some other languages allow use of UTF8 encoding of source files -- which allows non-ASCII hi-bit or multi-octet chars in variable and subroutine etc names and in string constants in code files, not just in data files/streams -- but it's mostly worked (sometimes requiring PERL_UNICODE=SAD enviroment prefix). This is the first time we've run across a need to detect BOM tag for file encoding. UTF-16 is not in good odor on Linux for historical reasons. (I forget the details.) Hence LOCALE=en_US.utf-16 is not an option to force Perl to deal with it.So the simple layer of tricks we use to force UTF8 onto Ack without explicit decoding support won't work for UTF16. (although if one installed an Asian UTF32 locale in one's Ubuntu etc, they might work? untried with Ack!) That you're searching DATA -- that you may have programs processing and so are using Ack to peek into the data while debugging (since we debug the GIGO data as often as we debug code!) -- will add just a little weight to the idea of detecting BOM prefixes and doing the right thing with them (decoding to internal). BOM detection and processing *looks* simple to _do_ but *not* simple to expand the test suite adequately to assure it doesn't result interact badly elsewhere. // Bill -- You received this message because you are subscribed to the Google Groups "ack users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/ack-users. For more options, visit https://groups.google.com/d/optout.
