the input:determine strategy will work fine for UTF-8 or SCSU, provided that the leading BOM is explicitly encoded. ...
With "determine" I do not mean to restrict to checking for a BOM. There are several ways to determine the input charset, depending on the protocol and document type etc., including but not limited to BOM, protocol field, in-doc specification, heuristics (guessing)...
About the BOM, or more precisely the Unicode signature byte sequences: Despite a theoretical ambiguity, it works quite well for discovering a Unicode charset, but unprepared and Unicode-unaware tools may choke on it.
The idea that "if a text (without BOM) looks like valid UTF-8, then it is UTF-8; else it uses another legacy encoding" does not work in practice and also leads to too many false positives.
It may not work in all cases, but working in >95% or so of cases in practice seems like it works quite well to me.
- if you are absolutely certain that they suffice - use US-ASCII or ISO 8859-1.
OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit labelling (with meta-data or other means) of its encoding: ...
If possible, *all* text should have its charset specified in some way.
I just wonder why Unicode still maintains that a BOM _should_ not be used in UTF-8 texts.
I believe that "Unicode" does not say that. It is a concern among users of Unicode-unaware tools like classic Unix-y command-line tools that are slow to add good Unicode support. You are right that the signatures work quite well with more modern tools.
markus