Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Ian Kelly
On Fri, May 29, 2015 at 4:44 AM, Jon Ribbens
 wrote:
> On 2015-05-29, Ian Kelly  wrote:
>> On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik  
>> wrote:
>>> Added Mailman to my suxx tracker:
>>> https://github.com/techtonik/suxx-tracker#mailman
>>
>> What a useless tool. Instead of tiredly complaining that things suck,
>> why not take some initiative to make them better?
>>
>> I'm curious about your complaint about virtualenv. How do you envision
>> that "logging in" to the env would be any different from activating
>> it?
>
> Please Do Not Feed The Troll.

It's not a troll if the discussion is potentially useful and not just
disruptive.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread anatoly techtonik
On Fri, May 29, 2015 at 2:39 PM, Laura Creighton  wrote:
> Do you know about the codecs module?
>
> reading http://pymotw.com/2/codecs/ may be useful if this is new to you.

Does that work for Python 2 and Python 3?

> Have you read https://www.python.org/dev/peps/pep-0293/ ?

No.

> Will backslashreplace do what you want?

I don't know. I am sorry, but what is there the code that
does this:

  binary -> escaped utf-8 string -> unicode -> binary

I know about coding module, but I am not seeing a solution
to crash-proof output from Python. Is inserting a custom codec
class into every piece of code that I want to debug is the only
solution?
-- 
anatoly t.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Laura Creighton
Do you know about the codecs module?

reading http://pymotw.com/2/codecs/ may be useful if this is new to you.

Have you read https://www.python.org/dev/peps/pep-0293/ ?

Will backslashreplace do what you want?

Laura

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Jon Ribbens
On 2015-05-29, Ian Kelly  wrote:
> On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik  
> wrote:
>> Added Mailman to my suxx tracker:
>> https://github.com/techtonik/suxx-tracker#mailman
>
> What a useless tool. Instead of tiredly complaining that things suck,
> why not take some initiative to make them better?
>
> I'm curious about your complaint about virtualenv. How do you envision
> that "logging in" to the env would be any different from activating
> it?

Please Do Not Feed The Troll.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Mark Lawrence

On 29/05/2015 11:02, Ian Kelly wrote:

On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik  wrote:

Added Mailman to my suxx tracker:
https://github.com/techtonik/suxx-tracker#mailman


What a useless tool. Instead of tiredly complaining that things suck,
why not take some initiative to make them better?



The guy who refuses to sign the CLA taking some initiative to make 
things better?  Is that an entire air force of flying pigs I observe 
going past my window, or merely one US atomic powered carrier's worth?


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Ian Kelly
On Fri, May 29, 2015 at 2:05 AM, anatoly techtonik  wrote:
> Added Mailman to my suxx tracker:
> https://github.com/techtonik/suxx-tracker#mailman

What a useless tool. Instead of tiredly complaining that things suck,
why not take some initiative to make them better?

I'm curious about your complaint about virtualenv. How do you envision
that "logging in" to the env would be any different from activating
it?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread anatoly techtonik
On Fri, May 29, 2015 at 11:41 AM, Laura Creighton  wrote:
> In a message of Fri, 29 May 2015 11:05:07 +0300, anatoly techtonik writes:
>
>>Added Mailman to my suxx tracker:
>>https://github.com/techtonik/suxx-tracker#mailman
>
> You are damning the wrong piece of software -- this is not a problem
> with mailman; mailman doesn't care at all what software you use to
> read mail and reply to it with.  The problem is with the various
> readers and repliers that people are using.  In particular, people on
> the other side of one the usenet -> python-list gateway may not be seeing
> this as mail at all, or sending their replies as mail.

Sounds legit. But middle ux in suxx stands for user experience,
and Mailman still doesn't improve it. If Mailman could subscribe
me automatically to the thread I am starting, that would resolve
all the problems.

> But back to your original problem.
>
> I still don't understand why you need to go from some lossless
> representation of your filename, back to the original.

It is just happened that the only way to get graph out of SCons
is to print its tree representation. That worked fine until we
switched to from StringIO to its io.StringIO unicode equivalent.

Dumping binary stuff in text form is a very common and reliable
way to backup and process data. Starting from SQL dumps to
SVN dumps - all these formats are convenient to store, transmit
and process.

> You start
> with the binary version of the filename  -- a series of bytes which
> turns out to be good Cyrillic text, but could be anything.

Right, good Cyrillic text in utf-8, and Python 2.x uses 'ascii', so if
Python 2.x used 'utf-8' as its default encoding, there won't be an
issue. For now. But I realize that it is not enough, so I want 100%
protection from unwanted crashes and data loss, so I want to
backslash non-utf-8 bytes when converting the data to unicode.

> You store
> that as the first so many bytes of your file. If ever you need to have
> the original representation of your filename, you already have it,
> right there, by reading the first so many bytes of your file.  Why
> care about what the user sees as a filename?

Not sure that I understand. I don't store anything in file. Build graph
is a representation of filesystem structure with entries that may or
may not exist. Node in build graph can also be a string that is never
written to disk. When I dump graph, I have no idea how I will
process it, but when I will need to identify some Node, grep it, find
a reference to it, I want its representation (which may as well serve
as ID) to be preserved to avoid conflicts and wrong interpretation
due to data loss

Hopefully now that my user story is clear, can you tell me how can I
do this bulletproof unicode conversion in Python 2? =)
-- 
anatoly t.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Laura Creighton
In a message of Fri, 29 May 2015 11:05:07 +0300, anatoly techtonik writes:

>Added Mailman to my suxx tracker:
>https://github.com/techtonik/suxx-tracker#mailman

You are damning the wrong piece of software -- this is not a problem
with mailman; mailman doesn't care at all what software you use to
read mail and reply to it with.  The problem is with the various
readers and repliers that people are using.  In particular, people on
the other side of one the usenet -> python-list gateway may not be seeing
this as mail at all, or sending their replies as mail.

But back to your original problem.

I still don't understand why you need to go from some lossless
representation of your filename, back to the original.  You start
with the binary version of the filename  -- a series of bytes which
turns out to be good Cyrillic text, but could be anything.  You store
that as the first so many bytes of your file. If ever you need to have
the original representation of your filename, you already have it,
right there, by reading the first so many bytes of your file.  Why
care about what the user sees as a filename?

Laura


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread Chris Angelico
On Fri, May 29, 2015 at 6:05 PM, anatoly techtonik  wrote:
>> On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik  
>> wrote:
>>> And the short answer is that we need unicode because we are printing this
>>> information to the stdout, and stdout is opened in text mode at least on
>>> Windows, and without explicit conversion, Python will try to decode stuff
>>> as being `ascii` and fail anyway.
>>
>> So you're working with text.
>
> No. It is unknown.
>
> I am printing Nodes of SCons build graph and I don't know how Nodes are
> represented. In my case it appeared that Node contained Russian text, which
> led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or 
> in
> KOI-8 and I can't do guessing of all possible encodings there. I just need to
> print that tree without crash or information loss.

You're saying it's text, but you don't know the encoding. You're
trying to display bytes as if they're text, but fundamentally, you're
trying to work with text.

>> That means you HAVE to decode it somehow;
>> you fundamentally cannot print bytes to the console. Lossless
>> concealment of arbitrary bytes won't help you.
>
> Won't help me with what? I am debugging build scripts to find out the
> *structure* of my dependencies and then all of the sudden Python crashes
> with UnicodeDecode error leaving me pronouncing bad Russian curses
> aloud.

Your fundamental problem is not the UnicodeDecodeError, but the
unknown encoding. What you're seeing is that Python refuses to be
sloppy.

>> If you can't adequately
>> decode everything, either backslash-escape the rest, or use a
>> replacement character; you can't print out those bytes.
>
> Yes. How to backslash the rest in Python 2? In Python 3 there is
> some freaky "surrogateescape" error strategy, but what to do in
> Python 2?

Not sure what's so freaky about it. But hey. If Python 2 can't do what
you want, is it so hard to use Python 3? Unicode support really is
better. Alternatively, just do something like this:

b = "some arbitrary byte string that you got from somewhere"
try:
text = b.decode("utf-8")
except UnicodeDecodeError:
text = repr(b).decode("ascii")

The repr of a byte string in Py2 should be a safe way to display
arbitrary bytes, without data loss. It will expand the string
significantly (four characters for one \xNN escape, plus adding
backslashes to everything else that needs them), but it does guarantee
safety.

> Replacement character is not a solution, because it is a data loss,
> and if I want to do post processing of graph log, I won't be able to
> recover the missing bits.
>
>> And no, I will not cc you. Subscribe to the list if you're going to
>> ask a question.
>
> Added Mailman to my suxx tracker:
> https://github.com/techtonik/suxx-tracker#mailman

Why? You're trying to fire questions out to a community without being
a part of that community. Why is that the software's problem?

You can either subscribe to the list/ng or follow via some web
interface, but it's unreasonable to ask everyone to cc you. Imagine if
we _did_ all cc you, but we also cc you in on an entire sub-thread
that you're not interested in. Or maybe half of us do and half don't.
What then? You don't get any sort of control over what you get copies
of. Is that really what you want?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-29 Thread anatoly techtonik
On Wed, May 27, 2015 at 3:57 PM, Laura Creighton  wrote:
> --- Forwarded Message
>
> Return-Path: 
> Received: from mail.python.org (mail.python.org [82.94.164.166])
> by theraft.openend.se (8.14.4/8.14.4/Debian-4) with ESMTP id 
> t4RC09ap02From: Chris Angelico 
> Cc: "python-list@python.org" 
>
>
> On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik  
> wrote:
>> And the short answer is that we need unicode because we are printing this
>> information to the stdout, and stdout is opened in text mode at least on
>> Windows, and without explicit conversion, Python will try to decode stuff
>> as being `ascii` and fail anyway.
>
> So you're working with text.

No. It is unknown.

I am printing Nodes of SCons build graph and I don't know how Nodes are
represented. In my case it appeared that Node contained Russian text, which
led to crash of SCons. It could contain Russian text in cp1251 or in utf-8 or in
KOI-8 and I can't do guessing of all possible encodings there. I just need to
print that tree without crash or information loss.

> That means you HAVE to decode it somehow;
> you fundamentally cannot print bytes to the console. Lossless
> concealment of arbitrary bytes won't help you.

Won't help me with what? I am debugging build scripts to find out the
*structure* of my dependencies and then all of the sudden Python crashes
with UnicodeDecode error leaving me pronouncing bad Russian curses
aloud.

It is not even less forgiving than Java, but is also more treacherous,
because of its run-time nature.

It will surely help to preserve my zen if Python could just flow through
the nodes of this graph. Garbage is okay - I can clean it up or remove if it
stands in the way, just disrupt my flow or say me that now I want to deal
with UnicodeDecode errors. Because I don't.

> If you can't adequately
> decode everything, either backslash-escape the rest, or use a
> replacement character; you can't print out those bytes.

Yes. How to backslash the rest in Python 2? In Python 3 there is
some freaky "surrogateescape" error strategy, but what to do in
Python 2?

Replacement character is not a solution, because it is a data loss,
and if I want to do post processing of graph log, I won't be able to
recover the missing bits.

> And no, I will not cc you. Subscribe to the list if you're going to
> ask a question.

Added Mailman to my suxx tracker:
https://github.com/techtonik/suxx-tracker#mailman

-- 
anatoly t.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

2015-05-27 Thread Laura Creighton
Chris Angelico apparantly has a problem with cc'd people who aren't
on the list.  python-list is very quiet these days, so if you
subscribe it won't be drinking from the firehose.  And you can
always turn off delivery when you are done.  Or you can just
go read the archives: 
https://mail.python.org/pipermail/python-list/2015-May/thread.html

Laura

--- Forwarded Message

Return-Path: 
Received: from mail.python.org (mail.python.org [82.94.164.166])
by theraft.openend.se (8.14.4/8.14.4/Debian-4) with ESMTP id 
t4RC09ap02From: Chris Angelico 
Cc: "python-list@python.org" 


On Wed, May 27, 2015 at 9:52 PM, anatoly techtonik  wrote:
> And the short answer is that we need unicode because we are printing this
> information to the stdout, and stdout is opened in text mode at least on
> Windows, and without explicit conversion, Python will try to decode stuff
> as being `ascii` and fail anyway.

So you're working with text. That means you HAVE to decode it somehow;
you fundamentally cannot print bytes to the console. Lossless
concealment of arbitrary bytes won't help you. If you can't adequately
decode everything, either backslash-escape the rest, or use a
replacement character; you can't print out those bytes.

And no, I will not cc you. Subscribe to the list if you're going to
ask a question.

ChrisA
- -- 
https://mail.python.org/mailman/listinfo/python-list

--- End of Forwarded Message
-- 
https://mail.python.org/mailman/listinfo/python-list