On 8/22/2014 9:52 AM, Oleg Broytman wrote:
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
<v+pyt...@g.nevcal.com> wrote:
On 8/22/2014 8:51 AM, Oleg Broytman wrote:
    What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
    Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.
    Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

Oleg.

I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file".

Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. Also, if it is an HTML file, I doubt the browser will use multiple different encodings when interpreting it, so it is not clear that the file is of practical use for its intended purpose if it contains text in multiple different encodings, but is served using only a single encoding, unless there is javascript or some programming in the browser that reencodes the data.

On the other hand, Python3 provides various facilities for working with such files.

The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent.

The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training.

The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files.

There may be other technique that I am not aware of.

Glenn
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to