Markus Scherer <markus....@gmail.com> wrote: |Does "iconv -f utf8 -t latin1 < ${i} | iconv -f utf8 -t utf8" not work? It |decodes one layer of UTF-8 and tests if the result is still in UTF-8, that |seems right, and should work for all of Unicode.
It does work for ÄEIÖÜ ① 𐇐 𝄢 🀂 𐂂 but the error channel should possibly be suppressed all along the way, as in FILE=some-file.txt (set +e; cat ${FILE} | iconv -f utf8 -t latin1 2>&1 | iconv -f utf8 -t utf8 >/dev/null 2>&1 && echo It is likely that the file ${FILE} is encoded twice) I mean, having a nice plain little C tool which simply iterates over the data and checks for the two-octet sequences that encoding UTF-8 into UTF-8 produces, checking the resulting sequences, too, and only replacing original input with such decoded output if at the end of the day the file consisted of at least one such sequence would also be nice. (At least it would integrate better into my workflow than some graphical JAVA ©® written by assembler-aware beautes :)) |markus --steffen