Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On Fri, May 29, 2015 at 4:44 AM, Jon Ribbens wrote: > On 2015-05-29, Ian Kelly wrote: >> On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik >> wrote: >>> Added Mailman to my suxx tracker: >>> https://github.com/techtonik/suxx-tracker#mailman >> >> What a useless tool. Instead of tiredly complaining that things suck, >> why not take some initiative to make them better? >> >> I'm curious about your complaint about virtualenv. How do you envision >> that "logging in" to the env would be any different from activating >> it? > > Please Do Not Feed The Troll. It's not a troll if the discussion is potentially useful and not just disruptive. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On Fri, May 29, 2015 at 2:39 PM, Laura Creighton wrote: > Do you know about the codecs module? > > reading http://pymotw.com/2/codecs/ may be useful if this is new to you. Does that work for Python 2 and Python 3? > Have you read https://www.python.org/dev/peps/pep-0293/ ? No. > Will backslashreplace do what you want? I don't know. I am sorry, but what is there the code that does this: binary -> escaped utf-8 string -> unicode -> binary I know about coding module, but I am not seeing a solution to crash-proof output from Python. Is inserting a custom codec class into every piece of code that I want to debug is the only solution? -- anatoly t. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
Do you know about the codecs module? reading http://pymotw.com/2/codecs/ may be useful if this is new to you. Have you read https://www.python.org/dev/peps/pep-0293/ ? Will backslashreplace do what you want? Laura -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On 2015-05-29, Ian Kelly wrote: > On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik > wrote: >> Added Mailman to my suxx tracker: >> https://github.com/techtonik/suxx-tracker#mailman > > What a useless tool. Instead of tiredly complaining that things suck, > why not take some initiative to make them better? > > I'm curious about your complaint about virtualenv. How do you envision > that "logging in" to the env would be any different from activating > it? Please Do Not Feed The Troll. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On 29/05/2015 11:02, Ian Kelly wrote: On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik wrote: Added Mailman to my suxx tracker: https://github.com/techtonik/suxx-tracker#mailman What a useless tool. Instead of tiredly complaining that things suck, why not take some initiative to make them better? The guy who refuses to sign the CLA taking some initiative to make things better? Is that an entire air force of flying pigs I observe going past my window, or merely one US atomic powered carrier's worth? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik wrote: > Added Mailman to my suxx tracker: > https://github.com/techtonik/suxx-tracker#mailman What a useless tool. Instead of tiredly complaining that things suck, why not take some initiative to make them better? I'm curious about your complaint about virtualenv. How do you envision that "logging in" to the env would be any different from activating it? -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On Fri, May 29, 2015 at 11:41 AM, Laura Creighton wrote: > In a message of Fri, 29 May 2015 11:05:07 +0300, anatoly techtonik writes: > >>Added Mailman to my suxx tracker: >>https://github.com/techtonik/suxx-tracker#mailman > > You are damning the wrong piece of software -- this is not a problem > with mailman; mailman doesn't care at all what software you use to > read mail and reply to it with. The problem is with the various > readers and repliers that people are using. In particular, people on > the other side of one the usenet -> python-list gateway may not be seeing > this as mail at all, or sending their replies as mail. Sounds legit. But middle ux in suxx stands for user experience, and Mailman still doesn't improve it. If Mailman could subscribe me automatically to the thread I am starting, that would resolve all the problems. > But back to your original problem. > > I still don't understand why you need to go from some lossless > representation of your filename, back to the original. It is just happened that the only way to get graph out of SCons is to print its tree representation. That worked fine until we switched to from StringIO to its io.StringIO unicode equivalent. Dumping binary stuff in text form is a very common and reliable way to backup and process data. Starting from SQL dumps to SVN dumps - all these formats are convenient to store, transmit and process. > You start > with the binary version of the filename -- a series of bytes which > turns out to be good Cyrillic text, but could be anything. Right, good Cyrillic text in utf-8, and Python 2.x uses 'ascii', so if Python 2.x used 'utf-8' as its default encoding, there won't be an issue. For now. But I realize that it is not enough, so I want 100% protection from unwanted crashes and data loss, so I want to backslash non-utf-8 bytes when converting the data to unicode. > You store > that as the first so many bytes of your file. If ever you need to have > the original representation of your filename, you already have it, > right there, by reading the first so many bytes of your file. Why > care about what the user sees as a filename? Not sure that I understand. I don't store anything in file. Build graph is a representation of filesystem structure with entries that may or may not exist. Node in build graph can also be a string that is never written to disk. When I dump graph, I have no idea how I will process it, but when I will need to identify some Node, grep it, find a reference to it, I want its representation (which may as well serve as ID) to be preserved to avoid conflicts and wrong interpretation due to data loss Hopefully now that my user story is clear, can you tell me how can I do this bulletproof unicode conversion in Python 2? =) -- anatoly t. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
In a message of Fri, 29 May 2015 11:05:07 +0300, anatoly techtonik writes: >Added Mailman to my suxx tracker: >https://github.com/techtonik/suxx-tracker#mailman You are damning the wrong piece of software -- this is not a problem with mailman; mailman doesn't care at all what software you use to read mail and reply to it with. The problem is with the various readers and repliers that people are using. In particular, people on the other side of one the usenet -> python-list gateway may not be seeing this as mail at all, or sending their replies as mail. But back to your original problem. I still don't understand why you need to go from some lossless representation of your filename, back to the original. You start with the binary version of the filename -- a series of bytes which turns out to be good Cyrillic text, but could be anything. You store that as the first so many bytes of your file. If ever you need to have the original representation of your filename, you already have it, right there, by reading the first so many bytes of your file. Why care about what the user sees as a filename? Laura -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On Fri, May 29, 2015 at 6:05 PM, anatoly techtonik wrote: >> On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik >> wrote: >>> And the short answer is that we need unicode because we are printing this >>> information to the stdout, and stdout is opened in text mode at least on >>> Windows, and without explicit conversion, Python will try to decode stuff >>> as being `ascii` and fail anyway. >> >> So you're working with text. > > No. It is unknown. > > I am printing Nodes of SCons build graph and I don't know how Nodes are > represented. In my case it appeared that Node contained Russian text, which > led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or > in > KOI-8 and I can't do guessing of all possible encodings there. I just need to > print that tree without crash or information loss. You're saying it's text, but you don't know the encoding. You're trying to display bytes as if they're text, but fundamentally, you're trying to work with text. >> That means you HAVE to decode it somehow; >> you fundamentally cannot print bytes to the console. Lossless >> concealment of arbitrary bytes won't help you. > > Won't help me with what? I am debugging build scripts to find out the > *structure* of my dependencies and then all of the sudden Python crashes > with UnicodeDecode error leaving me pronouncing bad Russian curses > aloud. Your fundamental problem is not the UnicodeDecodeError, but the unknown encoding. What you're seeing is that Python refuses to be sloppy. >> If you can't adequately >> decode everything, either backslash-escape the rest, or use a >> replacement character; you can't print out those bytes. > > Yes. How to backslash the rest in Python 2? In Python 3 there is > some freaky "surrogateescape" error strategy, but what to do in > Python 2? Not sure what's so freaky about it. But hey. If Python 2 can't do what you want, is it so hard to use Python 3? Unicode support really is better. Alternatively, just do something like this: b = "some arbitrary byte string that you got from somewhere" try: text = b.decode("utf-8") except UnicodeDecodeError: text = repr(b).decode("ascii") The repr of a byte string in Py2 should be a safe way to display arbitrary bytes, without data loss. It will expand the string significantly (four characters for one \xNN escape, plus adding backslashes to everything else that needs them), but it does guarantee safety. > Replacement character is not a solution, because it is a data loss, > and if I want to do post processing of graph log, I won't be able to > recover the missing bits. > >> And no, I will not cc you. Subscribe to the list if you're going to >> ask a question. > > Added Mailman to my suxx tracker: > https://github.com/techtonik/suxx-tracker#mailman Why? You're trying to fire questions out to a community without being a part of that community. Why is that the software's problem? You can either subscribe to the list/ng or follow via some web interface, but it's unreasonable to ask everyone to cc you. Imagine if we _did_ all cc you, but we also cc you in on an entire sub-thread that you're not interested in. Or maybe half of us do and half don't. What then? You don't get any sort of control over what you get copies of. Is that really what you want? ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
On Wed, May 27, 2015 at 3:57 PM, Laura Creighton wrote: > --- Forwarded Message > > Return-Path: > Received: from mail.python.org (mail.python.org [82.94.164.166]) > by theraft.openend.se (8.14.4/8.14.4/Debian-4) with ESMTP id > t4RC09ap02From: Chris Angelico > Cc: "python-list@python.org" > > > On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik > wrote: >> And the short answer is that we need unicode because we are printing this >> information to the stdout, and stdout is opened in text mode at least on >> Windows, and without explicit conversion, Python will try to decode stuff >> as being `ascii` and fail anyway. > > So you're working with text. No. It is unknown. I am printing Nodes of SCons build graph and I don't know how Nodes are represented. In my case it appeared that Node contained Russian text, which led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or in KOI-8 and I can't do guessing of all possible encodings there. I just need to print that tree without crash or information loss. > That means you HAVE to decode it somehow; > you fundamentally cannot print bytes to the console. Lossless > concealment of arbitrary bytes won't help you. Won't help me with what? I am debugging build scripts to find out the *structure* of my dependencies and then all of the sudden Python crashes with UnicodeDecode error leaving me pronouncing bad Russian curses aloud. It is not even less forgiving than Java, but is also more treacherous, because of its run-time nature. It will surely help to preserve my zen if Python could just flow through the nodes of this graph. Garbage is okay - I can clean it up or remove if it stands in the way, just disrupt my flow or say me that now I want to deal with UnicodeDecode errors. Because I don't. > If you can't adequately > decode everything, either backslash-escape the rest, or use a > replacement character; you can't print out those bytes. Yes. How to backslash the rest in Python 2? In Python 3 there is some freaky "surrogateescape" error strategy, but what to do in Python 2? Replacement character is not a solution, because it is a data loss, and if I want to do post processing of graph log, I won't be able to recover the missing bits. > And no, I will not cc you. Subscribe to the list if you're going to > ask a question. Added Mailman to my suxx tracker: https://github.com/techtonik/suxx-tracker#mailman -- anatoly t. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)
Chris Angelico apparantly has a problem with cc'd people who aren't on the list. python-list is very quiet these days, so if you subscribe it won't be drinking from the firehose. And you can always turn off delivery when you are done. Or you can just go read the archives: https://mail.python.org/pipermail/python-list/2015-May/thread.html Laura --- Forwarded Message Return-Path: Received: from mail.python.org (mail.python.org [82.94.164.166]) by theraft.openend.se (8.14.4/8.14.4/Debian-4) with ESMTP id t4RC09ap02From: Chris Angelico Cc: "python-list@python.org" On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik wrote: > And the short answer is that we need unicode because we are printing this > information to the stdout, and stdout is opened in text mode at least on > Windows, and without explicit conversion, Python will try to decode stuff > as being `ascii` and fail anyway. So you're working with text. That means you HAVE to decode it somehow; you fundamentally cannot print bytes to the console. Lossless concealment of arbitrary bytes won't help you. If you can't adequately decode everything, either backslash-escape the rest, or use a replacement character; you can't print out those bytes. And no, I will not cc you. Subscribe to the list if you're going to ask a question. ChrisA - -- https://mail.python.org/mailman/listinfo/python-list --- End of Forwarded Message -- https://mail.python.org/mailman/listinfo/python-list