On 8/22/2014 9:52 AM, Oleg Broytman wrote:
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman
<v+pyt...@g.nevcal.com> wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.
Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.
Oleg.
I was not declaring your file not to be a "text file" from any
definition obtained from Python3 documentation, just from a common sense
definition of "text file".
Looking at it from Python3, though, it is clear that when opening a file
in "text" mode, an encoding may be specified or will be assumed. That
is one encoding, applying to the whole file, not 3 encodings, with
declarations on when to switch between them. So I think, in general,
Python3 assumes or defines a definition of text file that matches my
"common sense" definition. Also, if it is an HTML file, I doubt the
browser will use multiple different encodings when interpreting it, so
it is not clear that the file is of practical use for its intended
purpose if it contains text in multiple different encodings, but is
served using only a single encoding, unless there is javascript or some
programming in the browser that reencodes the data.
On the other hand, Python3 provides various facilities for working with
such files.
The first I'll mention is the one that follows from my description of
what your file really is: Python3 allows opening files in binary mode,
and then decoding various sections of it using whatever encoding you
like, using the bytes.decode() operation on various sections of the
file. Determination of which sections are in which encodings is beyond
the scope of this description of the technique, and is application
dependent.
The second is to specify an error handler, that, like you, is trained to
recognize the other encodings and convert them appropriately. I'm not
aware that such an error handler has been or could be written, myself
not having your training.
The third is to specify the UTF-8 with the surrogate escape error
handler. This allows non-UTF-8 codes to be loaded into memory. You, or
algorithms as smart as you, could perhaps be developed to detect and
manipulate the resulting "lone surrogate" codes in meaningful ways, or
could simply allow them to ride along without interpretation, and be
emitted as the original, into other files.
There may be other technique that I am not aware of.
Glenn
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com