On 2018-05-29 19:47:37 +1000, Chris Angelico wrote:
> On Tue, May 29, 2018 at 6:34 PM, Peter J. Holzer <hjp-pyt...@hjp.at> wrote:
> > On 2018-05-23 06:03:38 +0000, Steven D'Aprano wrote:
> >> Mojibake is especially difficult to deal with when you are dealing with
> >> short text snippets like file names or user names which can contain
> >> arbitrary characters, where there is rarely any way to recognise the
> >> "correct" string.
> >
> > For single file names or user names, sure. But if you have a list of
> > them, there is still a high probability that many of them will contain
> > recognizable words which can be used to deduce the (or a) correct
> > encoding. (Unless it's from the Ministry of Silly Names).
> 
> Ohh... are you assuming that, in a list of file names, all of them use
> the same encoding? Ah, yes, well, that WOULD make it easier, wouldn't
> it. Sadly, not the case.

Not in general, but it *IS* the case we were talking about here. The
task is to find *an* encoding which can be used to decode *a* file. This
of course assumes that such an encoding exists. If there are several
encodings in the same file (I use "file" loosely here), then there is no
single encoding which can be used to decode it, so the task is
impossible. (You may still be able to split the file into chunks where
each chunk uses a specific encoding and determine that, but this is a
different task - and one for which the solution "ask the source" is even
less likely to work.)

        hp

-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | h...@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment: signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to