Re: [sqlite] How to search for fields with accents in UTF-8 data?

Olivier Mascia Tue, 20 Jun 2017 07:18:10 -0700

> Le 20 juin 2017 à 15:24, R Smith <rsm...@rsweb.co.za> a écrit :
> 
> As an aside - I never understood the reasons for that. I get that Windows has 
> a less "techy" clientèle than Linux for instance, and that the backwards 
> compatibility is paramount, and that no console command ever need fall 
> outside the 7-bit ANSI range of characters... but geez, how much effort can 
> it be to make it Unicode-friendly? It's not like the Windows API lacks any 
> Unicode functionality - even Notepad can handle it masterfully.


I wouldn't like looking like I'm trolling this subject, but this is only a 
matter of I/O functions used by programs built to interact with the display and 
keyboard when run in a console. Windows needs those programs to use 
ReadConsoleW/WriteConsoleW to do the proper thing.  Those programs using C 
library to read or output byte streams can't do anything equivalent no matter 
what 'codepage' is set to be used or to/from what DBCS the program attempts 
conversion to/from.

I learned this postulat here last year and have had excellent success with 
console I/O ever since in my programmings.

To be complete, regarding proper display of the output, there is a secondary 
consideration. The fonts available in Windows are far from covering a large 
subset of the glyphs.  For eastern languages on a western Windows edition, you 
generally need to change your console settings to make it use another font than 
the default one, just so that it can draw the needed glyphs.  But the basic 
thing to do is get the program running in the console (here we are talking 
shell.c - sqlite3.exe) to output Windows wide-chars using the function 
WriteConsoleW(). And use ReadConsoleW() to read wide-chars chunks from the 
console input, before converting internally to UTF-8 or whatever wanted.

Sqlite3 shell.c when patched that way is as pleasant to use on Windows console 
as it can be on a modern Linux or macOS.

Input files feeded to sqlite3.exe need to be in UTF-8, as well as output sent 
by sqlite3.exe will be: that part is perfectly OK today in sqlite3.exe. Only 
the keyboard reading and console output writing lacks a little.

> but geez, how much effort can it be to make it Unicode-friendly?

To further comment on a more general plane than the sqlite3.exe, the issue is 
deeper in Windows than in its console. Once upon a time (!), they made the 
choice of 16 bits per characters encoding as the *right* way (their right way!) 
to do Unicode. It took time for this to evolve, recognizing the need for 
multi-16 bits words encoding (UTF-16), so they could have chosen UTF-8 from day 
one, but that was not what history recorded. Later UTF-8 got *some* support in 
the OS (through conversion functions). But never UTF-8 was raised to full 
citizenship.  There is even a CHCP 65001 to set the 'codepage' to UTF-8. It 
works partly in some circumstances, but is far from being 'right'. No matter 
what you would do, there is no way for any file I/O primitive of the OS to take 
an UTF-8 string as a filename. And this extend to the C-library on Windows 
platform. The only unicode support is to pass a UTF-16 filename through 
functions ending with a W in the name. Those 'ansi' functions, ending with an A 
in the name, are merely wrappers converting to the wide chars versions.  There 
have been numerous requests to Microsoft to let people and developers set the 
ANSI codepage to UTF-8 so that the file I/O functions taking a narrow char 
filename string can interpret it as UTF-8. Some are still waiting for that day 
to come, others use the W-side of things, complicating portability of their 
codebase. :)


-- 
Best Regards, Meilleures salutations, Met vriendelijke groeten,
Olivier Mascia, http://integral.software


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] How to search for fields with accents in UTF-8 data?

Reply via email to