Re: [Python-Dev] Bytes path support

Glenn Linderman Fri, 22 Aug 2014 10:11:40 -0700

On 8/22/2014 9:52 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
<[email protected]> wrote:

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

    What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
    Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.

    Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.


Oleg.

I was not declaring your file not to be a "text file" from anydefinition obtained from Python3 documentation, just from a common sensedefinition of "text file".

Looking at it from Python3, though, it is clear that when opening a filein "text" mode, an encoding may be specified or will be assumed. Thatis one encoding, applying to the whole file, not 3 encodings, withdeclarations on when to switch between them. So I think, in general,Python3 assumes or defines a definition of text file that matches my"common sense" definition. Also, if it is an HTML file, I doubt thebrowser will use multiple different encodings when interpreting it, soit is not clear that the file is of practical use for its intendedpurpose if it contains text in multiple different encodings, but isserved using only a single encoding, unless there is javascript or someprogramming in the browser that reencodes the data.

On the other hand, Python3 provides various facilities for working withsuch files.

The first I'll mention is the one that follows from my description ofwhat your file really is: Python3 allows opening files in binary mode,and then decoding various sections of it using whatever encoding youlike, using the bytes.decode() operation on various sections of thefile. Determination of which sections are in which encodings is beyondthe scope of this description of the technique, and is applicationdependent.

The second is to specify an error handler, that, like you, is trained torecognize the other encodings and convert them appropriately. I'm notaware that such an error handler has been or could be written, myselfnot having your training.

The third is to specify the UTF-8 with the surrogate escape errorhandler. This allows non-UTF-8 codes to be loaded into memory. You, oralgorithms as smart as you, could perhaps be developed to detect andmanipulate the resulting "lone surrogate" codes in meaningful ways, orcould simply allow them to ride along without interpretation, and beemitted as the original, into other files.


There may be other technique that I am not aware of.

Glenn

_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Bytes path support

Reply via email to