Hello, On 2024-11-12 4:59, António MARTINS-Tuválkin via Unicode wrote:
I have been saying in the past couple decades that problems will vanish if all files include only “ASCII characters”, by means of NCR escape sequences, but some of the aforementioned individual editors seem unable to ensure it, so a wholesale “conversion” is the intermediate step that needs to be added to the workflow, before uploading.
I'm not sure NCR is the best way to go (also decades ago): it is just a numeric representation and not semantic (as with other HTML entities), and so finding problems may become difficult.
In my opinion, you should remove all NCR, else it would be a nightmare to check wrong encoding (maybe some of NCR where in Latin1, and some in Unicode, and the problem often we have double encoding). Also it makes difficult to correct spell. And also it is more simple to handle for people with all kind of experience. (Now UTF-8 can be used with all tools).
So I would try to transform text as UTF-8 without NCR (web now is default UTF-8).
Then you can check files (and where) there is a bad encoding (a transformation with other UTF encodings should give warnings, just discard the output and check the warnings).
In my experience, one site has common patterns, so the NCR and "bad characters" are limited on types, and you can use a text substitution (sed in Linux, macos, and I think various console tools in windows support it), or other more user friendly tools (see later). I find it easy and quick. It is not general, but as I said: often a site has common patterns, not many languages, etc. So I usually go to quick and dirty which in this case is better than a perfect solution which can handle all characters).
You may want to consider programmers or developers tools: Usually search and replace can be done on a tree of directories, with visual confirmation (e.g. jumping in the right file). They may also get batch encoding conversion: so possibly the best tools for such task, also if we do not program.
giacomo
