On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman <v+pyt...@g.nevcal.com> wrote:
> What encoding does have a text file (an HTML, to be precise) with > text in utf-8, ads in cp1251 (ad blocks were included from different > files) and comments in koi8-r? > Well, I must admit the HTML was rather an exception, but having a > text file with some strange characters (binary strings, or paragraphs > in different encodings) is not that exceptional. > > That's not a text file. That's a binary file containing (hopefully > delimited, and documented) sections of encoded text in different > encodings. > > Allow me to disagree. For me, this is a text file which I can (and > do) view with a pager, edit with a text editor, list on a console, > search with grep and so on. If it is not a text file by strict Python3 > standards then these standards are too strict for me. Either I find a > simple workaround in Python3 to work with such texts or find a different > tool. I cannot avoid such files because my reality is much more complex > than strict text/binary dichotomy in Python3. > > First -- we're getting OT here -- this thread was about file and path names, not the contents of files. But I suppose I brought that in when I talked about writing file names to files... The first I'll mention is the one that follows from my description of what > your file really is: Python3 allows opening files in binary mode, and then > decoding various sections of it using whatever encoding you like, using the > bytes.decode() operation on various sections of the file. Determination of > which sections are in which encodings is beyond the scope of this > description of the technique, and is application dependent. > right -- and you would have wanted to open such file in binary mode with py2 as well, but in that case, you's have the contents in py2 string object, which has a few more convenient ways to work with text (at least ascii-compatible) than the py3 bytes object does. The third is to specify the UTF-8 with the surrogate escape error handler. > This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as > smart as you, could perhaps be developed to detect and manipulate the > resulting "lone surrogate" codes in meaningful ways, or could simply allow > them to ride along without interpretation, and be emitted as the original, > into other files. > Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in? I wonder if this would make it hard to preserve byte boundaries, though. By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com