Re: Another 2 to 3 mail encoding problem

2020-08-31 Thread Peter J. Holzer
On 2020-08-27 09:34:47 +0100, Chris Green wrote:
> Peter J. Holzer  wrote:
> > The problem is that the message contains a '\ufeff' character (byte
> > order mark) where email/generator.py expects only ASCII characters.
> > 
> > I see two possible reasons for this:
[...]
> > Both reasons are weird.
[...]
> > But then you haven't shown where msg comes from. How do you parse the
> > message to get "msg"?
> > 
> > Can you construct a minimal test message which triggers the bug?
> > 
> Yes, simply sending myself an E-Mail with (for example) accented
> characters triggers the error.

Ok. So it's not a specific message, but any mail with accented
characters.

Since Python's mailbox module handles mails with accented characters
just fine (I've processed thousands of mails with it), the bug is almost
certainly in your program. And, as I explained above, almost certainly
in the part which you didn't show us.

Can you reduce your program to the minimum which still triggers the bug
and post the result here?

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread MRAB

On 2020-08-27 17:29, Barry Scott wrote:




On 26 Aug 2020, at 16:10, Chris Green  wrote:

 UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 
4: ordinal not in range(128)

So what do I need to do to the message I'm adding with mbx.add(msg) to
fix this?  (I assume that's what I need to do).



import unicodedata
unicodedata.name('\ufeff')

'ZERO WIDTH NO-BREAK SPACE'

I guess the editor you use to compose the text is adding that to your message.


That's used as a BOM (Byte-Order Marker) at the start of UTF16-BE.
It's also used at the start of UTF-8-SIG.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Barry Scott



> On 26 Aug 2020, at 16:10, Chris Green  wrote:
> 
>  UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in 
> position 4: ordinal not in range(128)
> 
> So what do I need to do to the message I'm adding with mbx.add(msg) to
> fix this?  (I assume that's what I need to do).

>>> import unicodedata
>>> unicodedata.name('\ufeff')
'ZERO WIDTH NO-BREAK SPACE'

I guess the editor you use to compose the text is adding that to your message.

Barry

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Barry


> On 27 Aug 2020, at 10:40, Chris Green  wrote:
> 
> Karsten Hilbert  wrote:
>>> Terry Reedy  wrote:
> On 8/26/2020 11:10 AM, Chris Green wrote:
> 
>> I have a simple[ish] local mbox mail delivery module as follows:-
> ...
>> It has run faultlessly for many years under Python 2.  I've now
>> changed the calling program to Python 3 and while it handles most
>> E-Mail OK I have just got the following error:-
>> 
>> Traceback (most recent call last):
>>   File "/home/chris/.mutt/bin/filter.py", line 102, in 
>> mailLib.deliverMboxMsg(dest, msg, log)
> ...
>>   File "/usr/lib/python3.8/email/generator.py", line 406, in write
>> self._fp.write(s.encode('ascii', 'surrogateescape'))
>> UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in
> position 4: ordinal not in range(128)

I would guess the fix is do s.encode(‘utf-8’).

You might need to add a header to say that you are using utf-8 to the 
email/mime-part.

If you do that does your code work?

Barry


> 
> '\ufeff' is the Unicode byte-order mark.  It should not be present in an
> ascii-only 3.x string and would not normally be present in general
> unicode except in messages like this that talk about it.  Read about it,
> for instance, at
> https://en.wikipedia.org/wiki/Byte_order_mark
> 
> I would catch the error and print part or all of string s to see what is
> going on with this particular message.  Does it have other non-ascii 
> chars?
> 
>>> I can provoke the error simply by sending myself an E-Mail with
>>> accented characters in it.  I'm pretty sure my Linux system is set up
>>> correctly for UTF8 characters, I certainly seem to be able to send and
>>> receive these to others and I even get to see messages in other
>>> scripts such as arabic, chinese, etc.
>>> 
>>> The code above works perfectly in Python 2 delivering messages with
>>> accented (and other extended) characters with no problems at all.
>>> Sending myself E-Mails with accented characters works OK with the code
>>> running under Python 2.
>>> 
>>> While an E-Mail body possibly *shouldn't* have non-ASCII characters in
>>> it one must be able to handle them without errors.  In fact haven't
>>> the RFCs changed such that the message body should be 8-bit clean?
>>> Anyway I think the Python 3 mail handling libraries need to be able to
>>> pass extended characters through without errors.
>> 
>> Well, '\ufeff' is not a *character* at all in much of any
>> sense of that word in unicode.
>> 
>> It's a marker. Whatever puts it into the stream is wrong. I guess the
>> best one can (and should) do is to catch the exception and dump
>> the offending stream somewhere binary-capable and pass on a notice. What
>> you are receiving there very much isn't a (well-formed) e-mail message.
>> 
>> I would then attempt to backwards-crawl the delivery chain to
>> find out where it came from.
>> 
> The error seems to occur with any non-7-bit-ASCII, e.g. my accented
> characters gave:-
> 
>  File "/usr/lib/python3.8/email/generator.py", line 406, in write
>  self._fp.write(s.encode('ascii', 'surrogateescape'))
>  UnicodeEncodeError: 'ascii' codec can't encode character
>  '\u2019' in position 34: ordinal not in
>   range(128)
> 
> It just happened that the first example was an escape.
> 
> -- 
> Chris Green
> ·
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Chris Green
Richard Damon  wrote:
> On 8/27/20 4:31 AM, Chris Green wrote:
> > While an E-Mail body possibly *shouldn't* have non-ASCII characters in
> > it one must be able to handle them without errors.  In fact haven't
> > the RFCs changed such that the message body should be 8-bit clean?
> > Anyway I think the Python 3 mail handling libraries need to be able to
> > pass extended characters through without errors.
> 
> Email message a fully allowed to use non-ASCII characters in them as
> long as the headers indicate this. They can be encoded either as raw 8
> bit bytes on systems that are 8-bit clean, or for systems that are not,
> they will need to be encoded either as base-64 or using quote-printable
> encoding. These characters are to interpreted in the character set
> defined (or presumed) in the header, or even some other binary object
> like and image or executable if the content type isn't text.
> 
> Because of this, the Python 3 str type is not suitable to store an email
> message, since it insists on the string being Unicode encoded, but the
> Python 2 str class could hold it.
> 
Which sounds like the core of my problem[s]! :-)

As I said my system (ignoring the Python issues) is all UTF8 and all
seems to work well so I think it's pretty much correctly configured.
When I send mail that has accented and other extended characters in it
the E-Mail headers have:-
Content-Type: text/plain; charset=utf-8

If I save a message like the above sent to myself it's stored using
the UTF8 characters directly, I can open it with my text editor (which
is also UTF8 aware) and see the characters as I entered them, there's
no encoding because my system is 8-bit clean and I'm talking to myself
as it were.

The above is using Python 2 to handle and filter my incoming mail
which, as you say, works fine.  However when I try switching to Python
3 I get the errors I've been asking about, even though this is
'talking to myself' and the E-Mail message is just UTF8.


-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Aw: Re: Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Karsten Hilbert
> > > Because of this, the Python 3 str type is not suitable to store an email
> > > message, since it insists on the string being Unicode encoded,
> >
> > I should greatly appreciate to be enlightened as to what
> > a "string being Unicode encoded" is intended to say ?
> >
>
> A Python 3 "str" or a Python 2 "unicode" is an abstract sequence of
> Unicode codepoints.

OK, I figured that much. So it was the "encoded" that threw me off.

Being a sequence of Unicode codepoints makes it en-Uni-coded at
a technically abstract level while I assumed the "encoded" is meant
to somehow reference ''.encode() and friends.

Karsten

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Chris Angelico
On Thu, Aug 27, 2020 at 11:10 PM Karsten Hilbert
 wrote:
>
> > Because of this, the Python 3 str type is not suitable to store an email
> > message, since it insists on the string being Unicode encoded,
>
> I should greatly appreciate to be enlightened as to what
> a "string being Unicode encoded" is intended to say ?
>

A Python 3 "str" or a Python 2 "unicode" is an abstract sequence of
Unicode codepoints. As such, it's not suitable for transparently
round-tripping an email, as it would lose information about the way
that things were encoded. However, it is excellent for building and
processing emails - you deal with character encodings at the same
point where you deal with the RFC 822 header format. In the abstract,
your headers might be stored in a dict, but then you encode them to a
flat sequence of bytes by putting "Header: value", wrapping correctly
- and also encode the text into bytes at the same time.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Aw: Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Karsten Hilbert
> Because of this, the Python 3 str type is not suitable to store an email
> message, since it insists on the string being Unicode encoded,

I should greatly appreciate to be enlightened as to what
a "string being Unicode encoded" is intended to say ?

Thanks,
Karsten
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Richard Damon
On 8/27/20 4:31 AM, Chris Green wrote:
> While an E-Mail body possibly *shouldn't* have non-ASCII characters in
> it one must be able to handle them without errors.  In fact haven't
> the RFCs changed such that the message body should be 8-bit clean?
> Anyway I think the Python 3 mail handling libraries need to be able to
> pass extended characters through without errors.

Email message a fully allowed to use non-ASCII characters in them as
long as the headers indicate this. They can be encoded either as raw 8
bit bytes on systems that are 8-bit clean, or for systems that are not,
they will need to be encoded either as base-64 or using quote-printable
encoding. These characters are to interpreted in the character set
defined (or presumed) in the header, or even some other binary object
like and image or executable if the content type isn't text.

Because of this, the Python 3 str type is not suitable to store an email
message, since it insists on the string being Unicode encoded, but the
Python 2 str class could hold it.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Chris Green
Karsten Hilbert  wrote:
> > Terry Reedy  wrote:
> > > On 8/26/2020 11:10 AM, Chris Green wrote:
> > >
> > > > I have a simple[ish] local mbox mail delivery module as follows:-
> > > ...
> > > > It has run faultlessly for many years under Python 2.  I've now
> > > > changed the calling program to Python 3 and while it handles most
> > > > E-Mail OK I have just got the following error:-
> > > >
> > > >  Traceback (most recent call last):
> > > >File "/home/chris/.mutt/bin/filter.py", line 102, in 
> > > >  mailLib.deliverMboxMsg(dest, msg, log)
> > > ...
> > > >File "/usr/lib/python3.8/email/generator.py", line 406, in write
> > > >  self._fp.write(s.encode('ascii', 'surrogateescape'))
> > > > UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in
> > > position 4: ordinal not in range(128)
> > >
> > > '\ufeff' is the Unicode byte-order mark.  It should not be present in an
> > > ascii-only 3.x string and would not normally be present in general
> > > unicode except in messages like this that talk about it.  Read about it,
> > > for instance, at
> > > https://en.wikipedia.org/wiki/Byte_order_mark
> > >
> > > I would catch the error and print part or all of string s to see what is
> > > going on with this particular message.  Does it have other non-ascii 
> > > chars?
> > >
> > I can provoke the error simply by sending myself an E-Mail with
> > accented characters in it.  I'm pretty sure my Linux system is set up
> > correctly for UTF8 characters, I certainly seem to be able to send and
> > receive these to others and I even get to see messages in other
> > scripts such as arabic, chinese, etc.
> >
> > The code above works perfectly in Python 2 delivering messages with
> > accented (and other extended) characters with no problems at all.
> > Sending myself E-Mails with accented characters works OK with the code
> > running under Python 2.
> >
> > While an E-Mail body possibly *shouldn't* have non-ASCII characters in
> > it one must be able to handle them without errors.  In fact haven't
> > the RFCs changed such that the message body should be 8-bit clean?
> > Anyway I think the Python 3 mail handling libraries need to be able to
> > pass extended characters through without errors.
> 
> Well, '\ufeff' is not a *character* at all in much of any
> sense of that word in unicode.
> 
> It's a marker. Whatever puts it into the stream is wrong. I guess the
> best one can (and should) do is to catch the exception and dump
> the offending stream somewhere binary-capable and pass on a notice. What
> you are receiving there very much isn't a (well-formed) e-mail message.
> 
> I would then attempt to backwards-crawl the delivery chain to
> find out where it came from.
> 
The error seems to occur with any non-7-bit-ASCII, e.g. my accented
characters gave:-

  File "/usr/lib/python3.8/email/generator.py", line 406, in write
  self._fp.write(s.encode('ascii', 'surrogateescape'))
  UnicodeEncodeError: 'ascii' codec can't encode character
  '\u2019' in position 34: ordinal not in
   range(128)

It just happened that the first example was an escape.

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Cameron Simpson
On 27Aug2020 09:31, Chris Green  wrote:
>I can provoke the error simply by sending myself an E-Mail with
>accented characters in it.  I'm pretty sure my Linux system is set up
>correctly for UTF8 characters, I certainly seem to be able to send and
>receive these to others and I even get to see messages in other
>scripts such as arabic, chinese, etc.

See:


https://docs.python.org/3/library/email.generator.html#module-email.generator

While is conservatively writes ASCII (and email has extensive support 
for encoding other character sets into ASCII), you might profit by 
looking at the BytesGenerator in that module using the policy parameter, 
which looks like it tunes the behaviour of the flatten method.

I have a mailfiler of my own, which copes just fine.

It loads messages with email.parser.Parser, whose .parse() method 
returns a Message, and Message.as_string() seems to write happily into a 
text file for me. I run _all_ my messages through this stuff.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Peter Otten
Chris Green wrote:

> To add a little to this, the problem is definitely when I receive a
> message with UTF8 (or at least non-ascci) characters in it.  My code
> is basically very simple, the main program reads an E-Mail message
> received from .forward on its standard input and makes it into an mbox
> message as follows:-
> 
> msg = mailbox.mboxMessage(sys.stdin.read())
> 
> it then does various tests (but doesn't change msg at all) and at the
> end delivers the message to my local mbox with:-
> 
> mbx.add(msg)
> 
> where mbx is an instance of mailbox.mbox.
> 
> 
> So, how is one supposed to handle this, should I encode the incoming
> message somewhere?
> 

This is what I'd try. Or just read the raw bytes:

data = sys.stdin.detach().read()
msg = mailbox.mboxMessage(data)

-- 
https://mail.python.org/mailman/listinfo/python-list


Aw: Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Karsten Hilbert
> Terry Reedy  wrote:
> > On 8/26/2020 11:10 AM, Chris Green wrote:
> >
> > > I have a simple[ish] local mbox mail delivery module as follows:-
> > ...
> > > It has run faultlessly for many years under Python 2.  I've now
> > > changed the calling program to Python 3 and while it handles most
> > > E-Mail OK I have just got the following error:-
> > >
> > >  Traceback (most recent call last):
> > >File "/home/chris/.mutt/bin/filter.py", line 102, in 
> > >  mailLib.deliverMboxMsg(dest, msg, log)
> > ...
> > >File "/usr/lib/python3.8/email/generator.py", line 406, in write
> > >  self._fp.write(s.encode('ascii', 'surrogateescape'))
> > > UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in
> > position 4: ordinal not in range(128)
> >
> > '\ufeff' is the Unicode byte-order mark.  It should not be present in an
> > ascii-only 3.x string and would not normally be present in general
> > unicode except in messages like this that talk about it.  Read about it,
> > for instance, at
> > https://en.wikipedia.org/wiki/Byte_order_mark
> >
> > I would catch the error and print part or all of string s to see what is
> > going on with this particular message.  Does it have other non-ascii chars?
> >
> I can provoke the error simply by sending myself an E-Mail with
> accented characters in it.  I'm pretty sure my Linux system is set up
> correctly for UTF8 characters, I certainly seem to be able to send and
> receive these to others and I even get to see messages in other
> scripts such as arabic, chinese, etc.
>
> The code above works perfectly in Python 2 delivering messages with
> accented (and other extended) characters with no problems at all.
> Sending myself E-Mails with accented characters works OK with the code
> running under Python 2.
>
> While an E-Mail body possibly *shouldn't* have non-ASCII characters in
> it one must be able to handle them without errors.  In fact haven't
> the RFCs changed such that the message body should be 8-bit clean?
> Anyway I think the Python 3 mail handling libraries need to be able to
> pass extended characters through without errors.

Well, '\ufeff' is not a *character* at all in much of any
sense of that word in unicode.

It's a marker. Whatever puts it into the stream is wrong. I guess the
best one can (and should) do is to catch the exception and dump
the offending stream somewhere binary-capable and pass on a notice. What
you are receiving there very much isn't a (well-formed) e-mail message.

I would then attempt to backwards-crawl the delivery chain to
find out where it came from.

Or so is my current understanding.

Karsten
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Chris Green
Peter J. Holzer  wrote:
> The problem is that the message contains a '\ufeff' character (byte
> order mark) where email/generator.py expects only ASCII characters.
> 
> I see two possible reasons for this:
> 
>  * The mbox writing code assumes that all messages with non-ascii
>characters are QP or base64 encoded, and some higher layer uses 8bit
>instead.
> 
>  * A mime-part is declared as charset=us-ascii but contains really
>Unicode characters.
> 
> Both reasons are weird.
> 
> The first would be an unreasonable assumption (8bit encoding has been
> common since the mid-1990s), but even if the code made that assumption,
> one would expect that other code from the same library honors it.
> 
> The second shouldn't be possible: If a message is mis-declared (that
> happens) one would expect that the error happens during parsing, not
> when trying to serialize the already parsed message. 
> 
> But then you haven't shown where msg comes from. How do you parse the
> message to get "msg"?
> 
> Can you construct a minimal test message which triggers the bug?
> 
Yes, simply sending myself an E-Mail with (for example) accented
characters triggers the error.

I'm pretty certain my system (and E-Mail in and out, and Usenet news)
handle these correctly as UTF8.  E.g.:-

àéçł

It's *only* when I switch the mail delivery to Python 3 that the error
appears.

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-27 Thread Chris Green
Terry Reedy  wrote:
> On 8/26/2020 11:10 AM, Chris Green wrote:
> 
> > I have a simple[ish] local mbox mail delivery module as follows:-
> ...
> > It has run faultlessly for many years under Python 2.  I've now
> > changed the calling program to Python 3 and while it handles most
> > E-Mail OK I have just got the following error:-
> > 
> >  Traceback (most recent call last):
> >File "/home/chris/.mutt/bin/filter.py", line 102, in 
> >  mailLib.deliverMboxMsg(dest, msg, log)
> ...
> >File "/usr/lib/python3.8/email/generator.py", line 406, in write
> >  self._fp.write(s.encode('ascii', 'surrogateescape'))
> > UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in 
> position 4: ordinal not in range(128) 
> 
> '\ufeff' is the Unicode byte-order mark.  It should not be present in an 
> ascii-only 3.x string and would not normally be present in general 
> unicode except in messages like this that talk about it.  Read about it, 
> for instance, at
> https://en.wikipedia.org/wiki/Byte_order_mark
> 
> I would catch the error and print part or all of string s to see what is 
> going on with this particular message.  Does it have other non-ascii chars?
> 
I can provoke the error simply by sending myself an E-Mail with
accented characters in it.  I'm pretty sure my Linux system is set up
correctly for UTF8 characters, I certainly seem to be able to send and
receive these to others and I even get to see messages in other
scripts such as arabic, chinese, etc.

The code above works perfectly in Python 2 delivering messages with
accented (and other extended) characters with no problems at all.
Sending myself E-Mails with accented characters works OK with the code
running under Python 2.

While an E-Mail body possibly *shouldn't* have non-ASCII characters in
it one must be able to handle them without errors.  In fact haven't
the RFCs changed such that the message body should be 8-bit clean?
Anyway I think the Python 3 mail handling libraries need to be able to
pass extended characters through without errors.

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-26 Thread Terry Reedy

On 8/26/2020 11:10 AM, Chris Green wrote:


I have a simple[ish] local mbox mail delivery module as follows:-

...

It has run faultlessly for many years under Python 2.  I've now
changed the calling program to Python 3 and while it handles most
E-Mail OK I have just got the following error:-

 Traceback (most recent call last):
   File "/home/chris/.mutt/bin/filter.py", line 102, in 
 mailLib.deliverMboxMsg(dest, msg, log)

...

   File "/usr/lib/python3.8/email/generator.py", line 406, in write
 self._fp.write(s.encode('ascii', 'surrogateescape'))
 UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in 
position 4: ordinal not in range(128)


'\ufeff' is the Unicode byte-order mark.  It should not be present in an 
ascii-only 3.x string and would not normally be present in general 
unicode except in messages like this that talk about it.  Read about it, 
for instance, at

https://en.wikipedia.org/wiki/Byte_order_mark

I would catch the error and print part or all of string s to see what is 
going on with this particular message.  Does it have other non-ascii chars?



--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Unsubscrip (Re: Another 2 to 3 mail encoding problem)

2020-08-26 Thread Terry Reedy

On 8/26/2020 11:27 AM, Alexa Oña wrote:

Don’t send me more emails



--
https://mail.python.org/mailman/listinfo/python-list


Unsubscribe yourself by going to the indicated url.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-26 Thread Michael Torrie
On 8/26/20 9:27 AM, Alexa Oña wrote:
> Don’t send me more emails
>
> https://mail.python.org/mailman/listinfo/python-list
^
Please unsubscribe from the mailing list. Click on the link above.

Thank you.



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-26 Thread Peter J. Holzer
On 2020-08-26 16:10:35 +0100, Chris Green wrote:
> I'm unearthing a few issues here trying to convert my mail filter and
> delivery programs from 2 to 3!  
> 
> I have a simple[ish] local mbox mail delivery module as follows:-
> 
[...]
> class mymbox(mailbox.mbox):
> def _pre_message_hook(self, f):
> """Don't write the blank line before the 'From '"""
> pass
[...]
> def deliverMboxMsg(dest, msg, log):
[...]
> mbx = mymbox(dest, factory=None)
[...]
> mbx.add(msg)
[...]
> 
> 
> It has run faultlessly for many years under Python 2.  I've now
> changed the calling program to Python 3 and while it handles most
> E-Mail OK I have just got the following error:-
> 
> Traceback (most recent call last):
>   File "/home/chris/.mutt/bin/filter.py", line 102, in 
> mailLib.deliverMboxMsg(dest, msg, log)
>   File "/home/chris/.mutt/bin/mailLib.py", line 52, in deliverMboxMsg
> mbx.add(msg)
>   File "/usr/lib/python3.8/mailbox.py", line 603, in add
> self._toc[self._next_key] = self._append_message(message)
>   File "/usr/lib/python3.8/mailbox.py", line 758, in _append_message
> offsets = self._install_message(message)
>   File "/usr/lib/python3.8/mailbox.py", line 830, in _install_message
> self._dump_message(message, self._file, self._mangle_from_)
>   File "/usr/lib/python3.8/mailbox.py", line 215, in _dump_message
> gen.flatten(message)
>   File "/usr/lib/python3.8/email/generator.py", line 116, in flatten
> self._write(msg)
>   File "/usr/lib/python3.8/email/generator.py", line 181, in _write
> self._dispatch(msg)
>   File "/usr/lib/python3.8/email/generator.py", line 214, in _dispatch
> meth(msg)
>   File "/usr/lib/python3.8/email/generator.py", line 432, in _handle_text
> super(BytesGenerator,self)._handle_text(msg)
>   File "/usr/lib/python3.8/email/generator.py", line 249, in _handle_text
> self._write_lines(payload)
>   File "/usr/lib/python3.8/email/generator.py", line 155, in _write_lines
> self.write(line)
>   File "/usr/lib/python3.8/email/generator.py", line 406, in write
> self._fp.write(s.encode('ascii', 'surrogateescape'))
> UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in 
> position 4: ordinal not in range(128)

The problem is that the message contains a '\ufeff' character (byte
order mark) where email/generator.py expects only ASCII characters.

I see two possible reasons for this:

 * The mbox writing code assumes that all messages with non-ascii
   characters are QP or base64 encoded, and some higher layer uses 8bit
   instead.

 * A mime-part is declared as charset=us-ascii but contains really
   Unicode characters.

Both reasons are weird.

The first would be an unreasonable assumption (8bit encoding has been
common since the mid-1990s), but even if the code made that assumption,
one would expect that other code from the same library honors it.

The second shouldn't be possible: If a message is mis-declared (that
happens) one would expect that the error happens during parsing, not
when trying to serialize the already parsed message. 

But then you haven't shown where msg comes from. How do you parse the
message to get "msg"?

Can you construct a minimal test message which triggers the bug?

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-26 Thread Python

Alexa Oña wrote:

Don’t send me more emails

Obtener Outlook para iOS


You are the one spamming the mailing list with unrelated posts.

STOP.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-26 Thread Chris Green
To add a little to this, the problem is definitely when I receive a
message with UTF8 (or at least non-ascci) characters in it.  My code
is basically very simple, the main program reads an E-Mail message
received from .forward on its standard input and makes it into an mbox
message as follows:-

msg = mailbox.mboxMessage(sys.stdin.read())

it then does various tests (but doesn't change msg at all) and at the
end delivers the message to my local mbox with:-

mbx.add(msg)

where mbx is an instance of mailbox.mbox.


So, how is one supposed to handle this, should I encode the incoming
message somewhere?

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Another 2 to 3 mail encoding problem

2020-08-26 Thread Alexa Oña
Don’t send me more emails

Obtener Outlook para iOS<https://aka.ms/o0ukef>

De: Python-list  
en nombre de Chris Green 
Enviado: Wednesday, August 26, 2020 5:10:35 PM
Para: python-list@python.org 
Asunto: Another 2 to 3 mail encoding problem

I'm unearthing a few issues here trying to convert my mail filter and
delivery programs from 2 to 3!

I have a simple[ish] local mbox mail delivery module as follows:-


import mailbox
import logging
import logging.handlers
import os
import time
#
#
# Class derived from mailbox.mbox so we can override _pre_message_hook()
# to do nothing instead of appending a blank line
#
class mymbox(mailbox.mbox):
def _pre_message_hook(self, f):
"""Don't write the blank line before the 'From '"""
pass
#
#
# log a message
#
def initLog(name):
log = logging.getLogger(name)
log.setLevel(logging.DEBUG)
f = logging.handlers.RotatingFileHandler("/home/chris/tmp/mail.log", 
'a', 100, 4)
f.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - 
%(message)s')
f.setFormatter(formatter)
log.addHandler(f)
return log
#
#
# Deliver a message to a local mbox
#
def deliverMboxMsg(dest, msg, log):
#
#
# Create the destination mbox instance
#
mbx = mymbox(dest, factory=None)

log.info("From: " + msg.get("From", "unknown"))
log.info("Destination is: " + dest)
#
#
# Lock the mbox while we append to it
#
for tries in range(3):
try:
mbx.lock()
#
#
# Append the incoming message to the appropriate mbox
#
mbx.add(msg)
#
#
# now set the modified time later than the access time (which 
is 'now') so
# that mutt will see new mail in the mbox.
#
os.utime(dest, ((time.time()), (time.time() + 5)))
mbx.unlock()
break

except mailbox.ExternalClashError:
log.info("Destination locked, try " + str(tries))
time.sleep(1)

else: # get here if we ran out of tries
log.warn("Failed to lock destination after 3 attempts, giving up")

return


It has run faultlessly for many years under Python 2.  I've now
changed the calling program to Python 3 and while it handles most
E-Mail OK I have just got the following error:-

Traceback (most recent call last):
  File "/home/chris/.mutt/bin/filter.py", line 102, in 
mailLib.deliverMboxMsg(dest, msg, log)
  File "/home/chris/.mutt/bin/mailLib.py", line 52, in deliverMboxMsg
mbx.add(msg)
  File "/usr/lib/python3.8/mailbox.py", line 603, in add
self._toc[self._next_key] = self._append_message(message)
  File "/usr/lib/python3.8/mailbox.py", line 758, in _append_message
offsets = self._install_message(message)
  File "/usr/lib/python3.8/mailbox.py", line 830, in _install_message
self._dump_message(message, self._file, self._mangle_from_)
  File "/usr/lib/python3.8/mailbox.py", line 215, in _dump_message
gen.flatten(message)
  File "/usr/lib/python3.8/email/generator.py", line 116, in flatten
self._write(msg)
  File "/usr/lib/python3.8/email/generator.py", line 181, in _write
self._dispatch(msg)
  File "/usr/lib/python3.8/email/generator.py", line 214, in _dispatch
meth(msg)
  File "/usr/lib/python3.8/email/generator.py", line 432, in _handle_text
super(BytesGenerator,self)._handle_text(msg)
  File "/usr/lib/python3.8/email/generator.py", line 249, in _handle_text
self._write_lines(payload)
  File "/usr/lib/python3.8/email/generator.py", line 155, in _write_lines
self.write(line)
  File "/usr/lib/python3.8/email/generator.py", line 406, in write
self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in 
position 4: ordinal not in range(128)

So what do I need to do to the message I'm adding with mbx.add(msg) to
fix this?  (I assume that's what I need to do).

--
Chris Green
·
--
https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Another 2 to 3 mail encoding problem

2020-08-26 Thread Chris Green
I'm unearthing a few issues here trying to convert my mail filter and
delivery programs from 2 to 3!  

I have a simple[ish] local mbox mail delivery module as follows:-


import mailbox
import logging
import logging.handlers
import os
import time
#
#
# Class derived from mailbox.mbox so we can override _pre_message_hook()
# to do nothing instead of appending a blank line
#
class mymbox(mailbox.mbox):
def _pre_message_hook(self, f):
"""Don't write the blank line before the 'From '"""
pass
#
#
# log a message
#
def initLog(name):
log = logging.getLogger(name)
log.setLevel(logging.DEBUG)
f = logging.handlers.RotatingFileHandler("/home/chris/tmp/mail.log", 
'a', 100, 4)
f.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - 
%(message)s')
f.setFormatter(formatter)
log.addHandler(f)
return log
#
#
# Deliver a message to a local mbox
#
def deliverMboxMsg(dest, msg, log):
#
#
# Create the destination mbox instance
#
mbx = mymbox(dest, factory=None)

log.info("From: " + msg.get("From", "unknown"))
log.info("Destination is: " + dest)
#
#
# Lock the mbox while we append to it
#
for tries in range(3):
try:
mbx.lock()
#
#
# Append the incoming message to the appropriate mbox
#
mbx.add(msg)
#
#
# now set the modified time later than the access time (which 
is 'now') so
# that mutt will see new mail in the mbox.
#
os.utime(dest, ((time.time()), (time.time() + 5)))
mbx.unlock()
break

except mailbox.ExternalClashError:
log.info("Destination locked, try " + str(tries))
time.sleep(1)

else: # get here if we ran out of tries
log.warn("Failed to lock destination after 3 attempts, giving up")

return


It has run faultlessly for many years under Python 2.  I've now
changed the calling program to Python 3 and while it handles most
E-Mail OK I have just got the following error:-

Traceback (most recent call last):
  File "/home/chris/.mutt/bin/filter.py", line 102, in 
mailLib.deliverMboxMsg(dest, msg, log)
  File "/home/chris/.mutt/bin/mailLib.py", line 52, in deliverMboxMsg
mbx.add(msg)
  File "/usr/lib/python3.8/mailbox.py", line 603, in add
self._toc[self._next_key] = self._append_message(message)
  File "/usr/lib/python3.8/mailbox.py", line 758, in _append_message
offsets = self._install_message(message)
  File "/usr/lib/python3.8/mailbox.py", line 830, in _install_message
self._dump_message(message, self._file, self._mangle_from_)
  File "/usr/lib/python3.8/mailbox.py", line 215, in _dump_message
gen.flatten(message)
  File "/usr/lib/python3.8/email/generator.py", line 116, in flatten
self._write(msg)
  File "/usr/lib/python3.8/email/generator.py", line 181, in _write
self._dispatch(msg)
  File "/usr/lib/python3.8/email/generator.py", line 214, in _dispatch
meth(msg)
  File "/usr/lib/python3.8/email/generator.py", line 432, in _handle_text
super(BytesGenerator,self)._handle_text(msg)
  File "/usr/lib/python3.8/email/generator.py", line 249, in _handle_text
self._write_lines(payload)
  File "/usr/lib/python3.8/email/generator.py", line 155, in _write_lines
self.write(line)
  File "/usr/lib/python3.8/email/generator.py", line 406, in write
self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in 
position 4: ordinal not in range(128)

So what do I need to do to the message I'm adding with mbx.add(msg) to
fix this?  (I assume that's what I need to do).

-- 
Chris Green
·
-- 
https://mail.python.org/mailman/listinfo/python-list


[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2019-04-01 Thread Inada Naoki


Change by Inada Naoki :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed
versions: +Python 3.7, Python 3.8 -Python 3.4, Python 3.5, Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2019-04-01 Thread Inada Naoki


Inada Naoki  added the comment:


New changeset 8384670615a90418fc52c3881242b7c10d1f2b13 by Inada Naoki in branch 
'3.7':
bpo-20844: open script file with "rb" mode (GH-12616)
https://github.com/python/cpython/commit/8384670615a90418fc52c3881242b7c10d1f2b13


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2019-04-01 Thread Inada Naoki


Change by Inada Naoki :


--
pull_requests: +12579

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2019-04-01 Thread Inada Naoki


Inada Naoki  added the comment:


New changeset 10654c19b5e6efdf3c529ff9bf7bcab89bdca1c1 by Inada Naoki in branch 
'master':
bpo-20844: open script file with "rb" mode (GH-12616)
https://github.com/python/cpython/commit/10654c19b5e6efdf3c529ff9bf7bcab89bdca1c1


--
nosy: +inada.naoki

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2019-03-29 Thread Inada Naoki


Change by Inada Naoki :


--
keywords: +patch
pull_requests: +12552
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: coding:gbk cause syntaxError

2018-11-11 Thread Emmanuel Arias


Emmanuel Arias  added the comment:

I can not reproduce this issue on my Debian9.

--
nosy: +eamanu

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: coding:gbk cause syntaxError

2018-11-11 Thread Steve Dower


Steve Dower  added the comment:

Yes, seems like we should be opening the file in binary mode, though I haven't 
tried it. The CRT's interpretation of text mode really isn't compatible with 
Python's own interpretation of text mode, and chaining them makes even less 
sense.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: coding:gbk cause syntaxError

2018-11-10 Thread Ma Lin


Ma Lin  added the comment:

I debugged, this is a duplicate of issue 20844 and issue 27797.
Eryk Sun analyzed this detailedly, it's a problem of Windows CRT.

--
versions: +Python 3.5, Python 3.6, Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: coding:gbk cause syntaxError

2018-11-02 Thread Windson Yang


Change by Windson Yang :


--
title: encoding problem: gbk -> encoding problem: coding:gbk cause syntaxError

___
Python tracker 
<https://bugs.python.org/issue35140>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread Windson Yang


Windson Yang  added the comment:

It's fine @anmikf, keep practice :D. Let's recap what happened:

Run encoding_problem_gbk.py on Windows10 using Python 3.7.0 will cause 
"SyntaxError: encoding problem: gbk". But it will run as expected if

1. The file has less than less than 15 lines.
2. Change coding:gbk to other encoding (like utf-8)
3. Remove coding:gbk

--

___
Python tracker 
<https://bugs.python.org/issue35140>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread Tim Golden


Tim Golden  added the comment:

I'm afraid you'll have to use English in this forum so that all current and 
future readers have the best chance of understanding the situation. Thank you 
very much for making the effort this far.

If anyone on this issue knows of a Chinese-language forum where this issue 
could explored before coming back here, please say so. Otherwise I'll ask 
around on Twitter etc. to see what's available

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread 安迷

安迷  added the comment:

I'm sorry for my english.
Can I use Chinese?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread 安迷

安迷  added the comment:

this problem not exist on macOS.
this problem not exist in python2.

Windows10x64   Python 3.7.0 (v3.7.0:1bf9cc5093

script have no problem with 15 blank lines.
script haveproblem with fist line '#coding:gbk' and 14 blank lines.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread Ma Lin


Ma Lin  added the comment:

Yes, I can reproduce on my Windows 10 (Simplfied Chinese).
The file is a pure ASCII file, and doesn't have a BOM prefix.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread Windson Yang


Windson Yang  added the comment:

Thank you, Lin. Can you reproduce on your machine, I guess it is related to 
terminal encoding or text file ending. However, I can't reproduce on macOS.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread Ma Lin


Ma Lin  added the comment:

Let me give an explanation.
Run encoding_problem_gbk.py, get an error:

D:\>encoding_problem_gbk.py
  File "D:\encoding_problem_gbk.py", line 1
SyntaxError: encoding problem: gbk

If remove the comment line, run as expected.

--
nosy: +Ma Lin

___
Python tracker 
<https://bugs.python.org/issue35140>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35140] encoding problem: gbk

2018-11-02 Thread Windson Yang


Windson Yang  added the comment:

If I understand your question correctly, you should save the file(the one 
contain Chinese chars) with GBK encoding using your editor. Otherwise, your 
editor would save it using the default encoding which led to python can't 
decode it correctly.

--
nosy: +Windson Yang

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35141] encoding problem: gbk

2018-11-01 Thread Karthikeyan Singaravelan


Karthikeyan Singaravelan  added the comment:

@anmikf Please use one issue for all the details. I am closing this as a 
duplicate of issue35140 since it has the same reproducer script and details. 
Feel free to reopen this if it's a different one adding little more context on 
the difference.

Thanks!

--
nosy: +xtreak
resolution:  -> duplicate
stage:  -> resolved
status: open -> closed
superseder:  -> encoding problem: gbk

___
Python tracker 
<https://bugs.python.org/issue35141>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35141] encoding problem: gbk

2018-11-01 Thread 安迷

New submission from 安迷 :

Windows10x64
Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit 
(AMD64)] on win32

--
components: Windows
files: encoding_problem_gbk.py
messages: 329099
nosy: anmikf, paul.moore, steve.dower, tim.golden, zach.ware
priority: normal
severity: normal
status: open
title: encoding problem: gbk
type: behavior
versions: Python 3.7
Added file: https://bugs.python.org/file47902/encoding_problem_gbk.py

___
Python tracker 
<https://bugs.python.org/issue35141>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2017-08-08 Thread Mark Lawrence

Changes by Mark Lawrence :


--
nosy:  -BreamoreBoy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2017-08-08 Thread Steven Winfield

Steven Winfield added the comment:

I've just been bitten by this on 3.6.2, Windows Server 2008 R2, when running 
the setup.py script for QuantLib-SWIG:
https://github.com/lballabio/QuantLib-SWIG/blob/v1.10.x/Python/setup.py

It seems there is different behaviour depending on whether:
  * Unix (LF) or Windows (CRLF) line endings are used
  * The file is >4096 bytes or <=4096 bytes
  * The module docstring has an initial space

Some of that has been mentioned previously, but I think the 4096-byte limit 
might be new, which is why I'm posting.

I've attached a script I used to come up with the results below. It contains:
  * a -*- coding line (for iso-8859-1 in this case)
  * a docstring consisting entirely of lines of x's, of length 78
  * Unix line endings

The file's length is exactly 4096 bytes.

Running this, or slightly modified versions of this, with a 3.6.2 interpreter 
gave the following results:

  * In all cases, when Windows line endings were used there was no issue - 
running the script produced no errors or output.

  * With Unix line endings:

* File length <= 4096, with no leading spaces in the docstring:
  File "issue20844.py", line 1
    SyntaxError: encoding problem: iso-8859-1

* File length > 4096, with no leading spaces in the docstring:
  File "issue20844.py", line 56
x"""
   ^
SyntaxError: EOF while scanning triple-quoted string literal


* Any file length, with the first 'x' on line 3 replaced with a space (line 
2 if the coding line is ignored):
  File "issue20844.py", line 2

x
^
IndentationError: unexpected indent

I had no issues with python 2.7.13.

--
nosy: +steven.winfield
versions: +Python 3.6
Added file: http://bugs.python.org/file47065/issue20844.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20844>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2014-12-27 Thread Ned Batchelder

Ned Batchelder added the comment:

This bug just bit me.  Changing # coding: utf8 to # coding: utf-8 works 
around it.

--
nosy: +nedbat

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20844
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2014-12-27 Thread Ned Batchelder

Ned Batchelder added the comment:

(oops: with Python 3.4.1 on Windows)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20844
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20844] SyntaxError: encoding problem: iso-8859-1 on Windows

2014-07-30 Thread Mark Lawrence

Mark Lawrence added the comment:

I've tried to make the title more meaningful, feel free to change it if you can 
think of something better.

--
components: +Interpreter Core
nosy: +tim.golden, zach.ware
title: coding bug remains in 3.3.5rc2 - SyntaxError: encoding problem: 
iso-8859-1 on Windows
type:  - behavior
versions: +Python 3.5 -Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20844
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Encoding problem in python

2013-08-21 Thread electron
If you use Arabic frequently on your system, I suggest to change your
windows system locale from Region and Language in control panel
(Administrative tab) and set to Arabic.

-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding problem in python

2013-03-04 Thread yomnasalah91
I have a problem with encoding in python 27 shell.

when i write this in the python shell:

w=u'العربى'

It gives me the following error:

Unsupported characters in input

any help?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem in python

2013-03-04 Thread Laszlo Nagy

On 2013-03-04 10:37, yomnasala...@gmail.com wrote:

I have a problem with encoding in python 27 shell.

when i write this in the python shell:

w=u'العربى'

It gives me the following error:

Unsupported characters in input

any help?
Maybe it is not Python related. Did you get an exception? Can you send a 
full traceback? I suspect that the error comes from your terminal, and 
not Python. Please make sure that your terminal supports UTF-8 encoding. 
Alternatively, try creating a file with this content:



# -*- encoding: UTF-8 -*-
w=u'العربى'

Save it as UTF-8 encoded file test.py (with an UTF-8 compatible 
editor, for example Geany) and run it as a command:



python test.py

If it works then it is sure that the problem is with your terminal. It 
will be an OS limitation, not Python's limitation.


Best,

   Laszlo
--
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem in python

2013-03-04 Thread Steven D'Aprano
On Mon, 04 Mar 2013 01:37:42 -0800, yomnasalah91 wrote:

 I have a problem with encoding in python 27 shell.
 
 when i write this in the python shell:
 
 w=u'العربى'
 
 It gives me the following error:
 
 Unsupported characters in input
 
 any help?

Firstly, please show the COMPLETE error, including the full traceback. 
Python errors look like (for example):

py x = ord(100)
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: ord() expected string of length 1, but int found


Copy and paste the complete traceback.


Secondly, please describe your environment:

- What operating system and version are you using? Linux, Windows, Mac 
OS, something else? Which version or distro?

- Which console or terminal application? E.g. cmd.exe (Windows), konsole, 
xterm, something else?

- Which shell? E.g. the standard Python interpreter, IDLE, bpython, 
something else?


My guess is that this is not a Python problem, but an issue with your 
console. You should always have your console set to use UTF-8, if 
possible. I expect that your console is set to use a different encoding. 
In that case, see if you can change it to UTF-8. For example, using Gnome 
Terminal on Linux, I can do this:


py w = u'العربى'
py print w
العربى

and it works fine, but if I change the encoding to WINDOWS-1252 using the 
Set character encoding menu command, the terminal will not allow me to 
paste the string into the terminal. 



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem in python

2013-03-04 Thread Vlastimil Brom
2013/3/4  yomnasala...@gmail.com:
 I have a problem with encoding in python 27 shell.

 when i write this in the python shell:

 w=u'العربى'

 It gives me the following error:

 Unsupported characters in input

 any help?
 --
 http://mail.python.org/mailman/listinfo/python-list


Hi,
I guess, you are using the built-in IDLE shell with python 2.7 and
this is a specific limitation of its handling of some unicode
characters  (in some builds and OSes - narrow-unicode, Windows, most
likely?) and its specific error message - not the usual python
traceback mentioned in other posts).
If it is viable, using python 3.3 instead would solve this problem for IDLE:

Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type copyright, credits or license() for more information.
 w='العربى'
 w
'العربى'

(note the missing u in unicode literal before the starting quotation
mark, which would be the usual usage in python 3, but python 3.3 also
silently ignores u... for compatibility.)

 w=u'العربى'
 w
'العربى'


If python 2.7 is required, another shell is probably needed (unless I
am missing some option to make IDLE work for this input);
e.g. the following works in pyshell - part of the wxpython GUI library
http://www.wxpython.org/

 w=u'العربى'
 w
u'\u0627\u0644\u0639\u0631\u0628\u0649'
 print w
العربى


hth,
   vbr
-- 
http://mail.python.org/mailman/listinfo/python-list


[issue13395] Python ISO-8859-1 encoding problem

2011-11-13 Thread Hugo Silva

New submission from Hugo Silva hugo...@gmail.com:

Hi all,

I'm facing a huge encoding problem in Python when dealing with ISO-8859-1 / 
Latin-1 character set.

When using os.listdir to get the contents of a folder I'm getting the strings 
encoded in ISO-8859-1 (ex: ''Ol\xe1 Mundo''), however in the Python interpreter 
the same string is encoded to a different charset:

In : 'Olá Mundo'.decode('latin-1')
Out: u'Ol\xa0 Mundo'

How can I force Python to decode the string to the same format. I've seen that 
os.listdir is returning the strings correctly encoded but the interpreter is 
not ('á' character corresponds to '\xe1' in ISO-8859-1, not to '\xa0'):

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

This is happening 

Any thoughts on how to overcome ?

Regards,

--
components: Unicode
messages: 147552
nosy: Hugo.Silva, ezio.melotti
priority: normal
severity: normal
status: open
title: Python ISO-8859-1 encoding problem
versions: Python 2.7

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13395
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13395] Python ISO-8859-1 encoding problem

2011-11-13 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

This doesn't seem a bug to me, so you should ask for help somewhere else.
You can try to pass a unicode arg to listdir to get unicode back, and double 
check what the input actually is.

--
resolution:  - invalid
stage:  - committed/rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13395
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13395] Python ISO-8859-1 encoding problem

2011-11-13 Thread Martin v . Löwis

Martin v. Löwis mar...@v.loewis.de added the comment:

Apparently, you are using the interactive shell on Microsoft Windows. This will 
use the OEM code page; which one that is depends on the exact Windows 
regional version you are using.

You shouldn't decode the string with 'latin-1', but with sys.stdin.encoding. 
Alternatively, you should use Unicode string literals in Python in the first 
place.

In any case, Ezio is right: this is not a help forum, but a bug tracker.

--
nosy: +loewis

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13395
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-08 Thread Nobody
On Wed, 05 Oct 2011 21:39:17 -0700, Greg wrote:

 Here is the final code for those who are struggling with similar
 problems:
 
 ## open and decode file
 # In this case, the encoding comes from the charset argument in a meta
 tag
 # e.g. meta charset=iso-8859-2
 fileObj = open(filePath,r).read()
 fileContent = fileObj.decode(iso-8859-2)
 fileSoup = BeautifulSoup(fileContent)

The fileObj.decode() step should be unnecessary, and is usually
undesirable; Beautiful Soup should be doing the decoding itself.

If you actually know the encoding (e.g. from a Content-Type header), you
can specify it via the fromEncoding parameter to the BeautifulSoup
constructor, e.g.:

fileSoup = BeautifulSoup(fileObj.read(), fromEncoding=iso-8859-2)

If you don't specify the encoding, it will be deduced from a meta tag if
one is present, or a Unicode BOM, or using the chardet library if
available, or using built-in heuristics, before finally falling back to
Windows-1252 (which seems to be the preferred encoding of people who don't
understand what an encoding is or why it needs to be specified).

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread Ulrich Eckhardt

Am 06.10.2011 05:40, schrieb Steven D'Aprano:

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.


Just wondering, why do you split the latter two parts? I would have used 
codecs.open() to open the file and define the encoding in a single step. 
Is there a downside to this approach?


Otherwise, I can only confirm that your overall approach is the easiest 
way to get correct results.


Uli
--
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread Chris Angelico
On Thu, Oct 6, 2011 at 8:29 PM, Ulrich Eckhardt
ulrich.eckha...@dominalaser.com wrote:
 Just wondering, why do you split the latter two parts? I would have used
 codecs.open() to open the file and define the encoding in a single step. Is
 there a downside to this approach?


Those two steps still happen, even if you achieve them in a single
function call. What Steven described is language- and library-
independent.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread jmfauth
On 6 oct, 06:39, Greg gregor.hochsch...@googlemail.com wrote:
 Brilliant! It worked. Thanks!

 Here is the final code for those who are struggling with similar
 problems:

 ## open and decode file
 # In this case, the encoding comes from the charset argument in a meta
 tag
 # e.g. meta charset=iso-8859-2
 fileObj = open(filePath,r).read()
 fileContent = fileObj.decode(iso-8859-2)
 fileSoup = BeautifulSoup(fileContent)

 ## Do some BeautifulSoup magic and preserve unicode, presume result is
 saved in 'text' ##

 ## write extracted text to file
 f = open(outFilePath, 'w')
 f.write(text.encode('utf-8'))
 f.close()




or  (Python2/Python3)

 import io
 with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:
... r = f.read()
...
 repr(r)
u'a\nb\nc\n'
 with io.open('def.txt', 'w', encoding='utf-8-sig') as f:
... t = f.write(r)
...
 f.closed
True

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread xDog Walker
On Thursday 2011 October 06 10:41, jmfauth wrote:
 or  (Python2/Python3)

  import io
  with io.open('abc.txt', 'r', encoding='iso-8859-2') as f:

 ...     r = f.read()
 ...

  repr(r)

 u'a\nb\nc\n'

  with io.open('def.txt', 'w', encoding='utf-8-sig') as f:

 ...     t = f.write(r)
 ...

  f.closed

 True

 jmf

What is this  io  of which you speak?

-- 
I have seen the future and I am not in it.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-06 Thread John Gordon
In mailman.1785.1317928997.27778.python-l...@python.org xDog Walker 
thud...@gmail.com writes:

 What is this  io  of which you speak?

It was introduced in Python 2.6.

-- 
John Gordon   A is for Amy, who fell down the stairs
gor...@panix.com  B is for Basil, assaulted by bears
-- Edward Gorey, The Gashlycrumb Tinies

-- 
http://mail.python.org/mailman/listinfo/python-list


encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Greg
Hi, I am having some encoding problems when I first parse stuff from a
non-english website using BeautifulSoup and then write the results to
a txt file.

I have the text both as a normal (text) and as a unicode string
(utext):
print repr(text)
'Branie zak\xc2\xb3adnik\xc3\xb3w'

print repr(utext)
u'Branie zak\xb3adnik\xf3w'

print text or print utext (fileSoup.prettify() also shows 'wrong'
symbols):
Branie zak³adników


Now I am trying to save this to a file but I never get the encoding
right. Here is what I tried (+ lot's of different things with encode,
decode...):
outFile=open(filePath,w)
outFile.write(text)
outFile.close()

outFile=codecs.open( filePath, w, UTF8 )
outFile.write(utext)
outFile.close()

Thanks!!





-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Steven D'Aprano
On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:

 Hi, I am having some encoding problems when I first parse stuff from a
 non-english website using BeautifulSoup and then write the results to a
 txt file.

If you haven't already read this, you should do so:

http://www.joelonsoftware.com/articles/Unicode.html



 I have the text both as a normal (text) and as a unicode string (utext):
 print repr(text)
 'Branie zak\xc2\xb3adnik\xc3\xb3w'

This is pretty much meaningless, because we don't know how you got the 
text and what it actually is. You're showing us a bunch of bytes, with no 
clue as to whether they are the right bytes or not. Considering that your 
Unicode text is also incorrect, I would say it is *not* right and your 
description of the problem is 100% backwards: the problem is not 
*writing* the text, but *reading* the bytes and decoding it.


You should do something like this:

(1) Inspect the web page to find out what encoding is actually used.

(2) If the web page doesn't know what encoding it uses, or if it uses 
bits and pieces of different encodings, then the source is broken and you 
shouldn't expect much better results. You could try guessing, but you 
should expect mojibake in your results.

http://en.wikipedia.org/wiki/Mojibake

(3) Decode the web page into Unicode text, using the correct encoding.

(4) Do all your processing in Unicode, not bytes.

(5) Encode the text into bytes using UTF-8 encoding.

(6) Write the bytes to a file.


[...]
 Now I am trying to save this to a file but I never get the encoding
 right. Here is what I tried (+ lot's of different things with encode,
 decode...):

 outFile=codecs.open( filePath, w, UTF8 ) 
 outFile.write(utext)
 outFile.close()

That's the correct approach, but it won't help you if utext contains the 
wrong characters in the first place. The critical step is taking the 
bytes in the web page and turning them into text.

How are you generating utext?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Greg
Brilliant! It worked. Thanks!

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. meta charset=iso-8859-2
fileObj = open(filePath,r).read()
fileContent = fileObj.decode(iso-8859-2)
fileSoup = BeautifulSoup(fileContent)

## Do some BeautifulSoup magic and preserve unicode, presume result is
saved in 'text' ##

## write extracted text to file
f = open(outFilePath, 'w')
f.write(text.encode('utf-8'))
f.close()



On Oct 5, 11:40 pm, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:
 On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:
  Hi, I am having some encoding problems when I first parse stuff from a
  non-english website using BeautifulSoup and then write the results to a
  txt file.

 If you haven't already read this, you should do so:

 http://www.joelonsoftware.com/articles/Unicode.html

  I have the text both as a normal (text) and as a unicode string (utext):
  print repr(text)
  'Branie zak\xc2\xb3adnik\xc3\xb3w'

 This is pretty much meaningless, because we don't know how you got the
 text and what it actually is. You're showing us a bunch of bytes, with no
 clue as to whether they are the right bytes or not. Considering that your
 Unicode text is also incorrect, I would say it is *not* right and your
 description of the problem is 100% backwards: the problem is not
 *writing* the text, but *reading* the bytes and decoding it.

 You should do something like this:

 (1) Inspect the web page to find out what encoding is actually used.

 (2) If the web page doesn't know what encoding it uses, or if it uses
 bits and pieces of different encodings, then the source is broken and you
 shouldn't expect much better results. You could try guessing, but you
 should expect mojibake in your results.

 http://en.wikipedia.org/wiki/Mojibake

 (3) Decode the web page into Unicode text, using the correct encoding.

 (4) Do all your processing in Unicode, not bytes.

 (5) Encode the text into bytes using UTF-8 encoding.

 (6) Write the bytes to a file.

 [...]

  Now I am trying to save this to a file but I never get the encoding
  right. Here is what I tried (+ lot's of different things with encode,
  decode...):
  outFile=codecs.open( filePath, w, UTF8 )
  outFile.write(utext)
  outFile.close()

 That's the correct approach, but it won't help you if utext contains the
 wrong characters in the first place. The critical step is taking the
 bytes in the web page and turning them into text.

 How are you generating utext?

 --
 Steven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem with BeautifulSoup - problem when writing parsed text to file

2011-10-05 Thread Chris Angelico
On Thu, Oct 6, 2011 at 3:39 PM, Greg gregor.hochsch...@googlemail.com wrote:
 Brilliant! It worked. Thanks!

 Here is the final code for those who are struggling with similar
 problems:

 ## open and decode file
 # In this case, the encoding comes from the charset argument in a meta
 tag
 # e.g. meta charset=iso-8859-2
 fileContent = fileObj.decode(iso-8859-2)
 f.write(text.encode('utf-8'))

In other words, when you decode correctly into Unicode and encode
correctly onto the disk, it works!

This is why encodings are so important :)

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-13 Thread Sérgio Monteiro Basto
Ian Kelly wrote:

 If you want your output to behave that way, then all you have to do is
 specify that with an explicit encode step.

ok 

 If we want we change default for whatever we want, but without this
 default change Python should not change his behavior depending on
 output. yeah I prefer strange output for a different platform, to a
 decode errors.
 
 Sorry, I disagree.  If your program is going to fail, it's better that
 it fail noisily (with an error) than silently (with no notice that
 anything is wrong).

Hi, 
ok a little resume, I got the solution which is setting env with 
PYTHONIOENCODING=utf-8, which if it was a default for modern GNU Linux, was 
made me save lots of time.
My practical problem is simple like, I make a script that want run in shell 
for testing and log to a file when use with a configuration. 
Everything runs well in a shell and sometimes (later) fails when log to a 
file, with a  UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position.
So to work in both cases (tty and files), I filled all code with string 
.encode('utf-8') to workaround, when what always I want was use  
PYTHONIOCONDIG=utf-8. I got anything in utf-8, database is in utf-8, I 
coding in utf-8, my OS is in utf-8. In last about 3 years of learning Python 
I lost many many hours to understand this problem.  
And see, I can send ascii and utf-8 to utf-8 output and never have problems, 
but if I send ascii and utf-8 to ascii files sometimes got encode errors.
So you please consider, at least on Linux, default encode to utf-8 (because 
we have less problems) or make more clear that pipe to a file is different 
to a tty and problem was in files that defaults to ascii. Or 
make the default of IOENCONDIG based on env LANG.

Anyway many thanks for your time and for help me out.
I don't know how run the things in Python 3 , in python 3 defaults are utf-8 
? 

Thanks, 
--
Sérgio M. B. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-13 Thread Chris Angelico
2011/6/14 Sérgio Monteiro Basto sergi...@sapo.pt:
 And see, I can send ascii and utf-8 to utf-8 output and never have problems,
 but if I send ascii and utf-8 to ascii files sometimes got encode errors.


If something fits inside 7-bit ASCII, it is by definition valid UTF-8.
This is not a coincidence.

Those hours you've spent grokking this are not wasted, if you now have
a comprehension of characters vs encodings. More people in the world
need to understand that difference! :)

Chris Angelico
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ../test.py
 moçambique
 moçambique


The following tries to encode before to print. If you pass an already 
utf-8 object, it just print it; if not it encode it. All the print 
statements pass by MyPrint.write


#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode(utf8)
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode(utf-8)
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py  test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the 
errors are printed by sys.stderr.write. So if you want to do

raise moçambique
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend Comprendre les erreurs 
unicode by Victor Stinner :

http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Sérgio Monteiro Basto
Ben Finney wrote:

  What should it decode to, then?

 UTF-8, as in tty
 
 But when you explicitly redirect to a file, it's not going to a TTY.
 It's going to a file whose encoding isn't known unless you specify it.

ok after thinking about this, this problem exist because Python want be 
smart with ttys, which is in my point of view is wrong, should not encode to 
utf-8, because tty is in utf-8. Python should always encode to the same 
thing. If the default is ascii, should always encode to ascii. 
yeah should send to tty in ascii, if I send my code to a guy in windows 
which use tty with cp1000whatever , shouldn't give decoding errors and 
should send in ascii . 
If we want we change default for whatever we want, but without this default 
change Python should not change his behavior depending on output.  
yeah I prefer strange output for a different platform, to a decode errors. 
And I have /usr/bin/iconv .
 
Thanks for attention, sorry about my very limited English.
--
Sérgio M. B.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Ian Kelly
2011/6/10 Sérgio Monteiro Basto sergi...@sapo.pt:
 ok after thinking about this, this problem exist because Python want be
 smart with ttys, which is in my point of view is wrong, should not encode to
 utf-8, because tty is in utf-8. Python should always encode to the same
 thing. If the default is ascii, should always encode to ascii.
 yeah should send to tty in ascii, if I send my code to a guy in windows
 which use tty with cp1000whatever , shouldn't give decoding errors and
 should send in ascii .

You can't have your cake and eat it too.  If Python needs to output a
string in ascii, and that string can't be represented in ascii, then
raising an exception is the only reasonable thing to do.  You seem to
be suggesting that Python should do an implicit output.encode('ascii',
'replace') on all Unicode output, which might be okay for a TTY, but
you wouldn't want that for file output; it would allow Python to
silently create garbage data.

And what if you send your code to somebody with a UTF-16 terminal?
You try to output ASCII to that, and you're just going to get complete
garbage.

If you want your output to behave that way, then all you have to do is
specify that with an explicit encode step.

 If we want we change default for whatever we want, but without this default
 change Python should not change his behavior depending on output.
 yeah I prefer strange output for a different platform, to a decode errors.

Sorry, I disagree.  If your program is going to fail, it's better that
it fail noisily (with an error) than silently (with no notice that
anything is wrong).
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-10 Thread Chris Angelico
2011/6/11 Sérgio Monteiro Basto sergi...@sapo.pt:
 ok after thinking about this, this problem exist because Python want be
 smart with ttys

The *anomaly* (not problem) exists because Python has a way of being
told a target encoding. If two parties agree on an encoding, they can
send characters to each other. I had this discussion at work a while
ago; my boss was talking about being binary-safe (which really meant
8-bit safe), while I was saying that we should support, verify, and
demand properly-formed UTF-8. The main significance is that agreeing
on an encoding means we can change the encoding any time it's
convenient, without having to document that we've changed the data -
because we haven't. I can take the number twelve thousand three
hundred and forty-five and render that as a string of decimal digits
as 12345, or as hexadecimal digits as 3039, but I haven't changed
the number. If you know that I'm giving you a string of decimal
digits, and I give you 12345, you will get the same number at the
far side.

Python has agreed with stdout that it will send it characters encoded
in UTF-8. Having made that agreement, Python and stdout can happily
communicate in characters, not bytes. You don't need to explicitly
encode your characters into bytes - and in fact, this would be a very
bad thing to do, because you don't know _what_ encoding stdout is
using. If it's expecting UTF-16, you'll get a whole lot of rubbish if
you send it UTF-8 - but it'll look fine if you send it Unicode.

Chris Angelico
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Benjamin Kaplan wrote:

 2011/6/8 Sérgio Monteiro Basto sergi...@sapo.pt:
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ./test.py
 moçambique
 moçambique

 ./test.py  output.txt
 Traceback (most recent call last):
 File ./test.py, line 5, in module
 print u
 UnicodeEncodeError: 'ascii' codec can't encode character
 u'\xe7' in position 2: ordinal not in range(128)

 in python 2.7
 how I explain to python to send the same thing to stdout and
 the file output.txt ?

 Don't seems logic, when send things to a file the beaviour
 change.

 Thanks,
 Sérgio M. B.
 
 That's not a terminal vs file thing. It's a file that declares it's
 encoding vs a file that doesn't declare it's encoding thing. Your
 terminal declares that it is UTF-8. So when you print a Unicode string
 to your terminal, Python knows that it's supposed to turn it into
 UTF-8. When you pipe the output to a file, that file doesn't declare
 an encoding. So rather than guess which encoding you want, Python
 defaults to the lowest common denominator: ASCII. If you want
 something to be a particular encoding, you have to encode it yourself.

Exactly the opposite , if python don't know the encoding should not try 
decode to ASCII.

 
 You have a couple of choices on how to make it work:
 1) Play dumb and always encode as UTF-8. This would look really weird
 if someone tried running your program in a terminal with a CP-847
 encoding (like cmd.exe on at least the US version of Windows), but it
 would never crash.

I want python don't care about encoding terminal and send characters as they 
are or for a file . 

 2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
 string in the string-escape encoding, which substitutes the escape
 sequence in for all non-ASCII characters.

How I change sys.stdout.encoding always to UTF-8 ? at least have a  
consistent sys.stdout.encoding 

 3) Check to see if sys.stdout.isatty() and have different behavior for
 terminals vs files. If you're on a terminal that doesn't declare its
 encoding, encoding it as UTF-8 probably won't help. If you're writing
 to a file, that might be what you want to do.


Thanks,


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Ben Finney wrote:

 Sérgio Monteiro Basto sergi...@sapo.pt writes:
 
 ./test.py
 moçambique
 moçambique
 
 In this case your terminal is reporting its encoding to Python, and it's
 capable of taking the UTF-8 data that you send to it in both cases.
 
 ./test.py  output.txt
 Traceback (most recent call last):
   File ./test.py, line 5, in module
 print u
 UnicodeEncodeError: 'ascii' codec can't encode character
 u'\xe7' in position 2: ordinal not in range(128)
 
 In this case your shell has no preference for the encoding (since you're
 redirecting output to a file).
 

How I say to python that I want that write in utf-8 to files ? 


 In the first print statement you specify the encoding UTF-8, which is
 capable of encoding the characters.
 
 In the second print statement you haven't specified any encoding, so the
 default ASCII encoding is used.
 
 
 Moral of the tale: Make sure an encoding is specified whenever data
 steps between bytes and characters.
 
 Don't seems logic, when send things to a file the beaviour change.
 
 They're different files, which have been opened with different
 encodings. If you want a different encoding, you need to specify that.
 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Nobody
On Thu, 09 Jun 2011 22:14:17 +0100, Sérgio Monteiro Basto wrote:

 Exactly the opposite , if python don't know the encoding should not try 
 decode to ASCII.

What should it decode to, then?

You can't write characters to a stream, only bytes.

 I want python don't care about encoding terminal and send characters as they 
 are or for a file . 

You can't write characters to a stream, only bytes.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Ben Finney
Sérgio Monteiro Basto sergi...@sapo.pt writes:

 Ben Finney wrote:

  In this case your shell has no preference for the encoding (since
  you're redirecting output to a file).

 How I say to python that I want that write in utf-8 to files ? 

You already did:

  In the first print statement you specify the encoding UTF-8, which
  is capable of encoding the characters.

If you want UTF-8 on the byte stream for a file, specify it when opening
the file, or when reading or writing the file.

-- 
 \   “But Marge, what if we chose the wrong religion? Each week we |
  `\  just make God madder and madder.” —Homer, _The Simpsons_ |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Terry Reedy

On 6/9/2011 5:46 PM, Nobody wrote:

On Thu, 09 Jun 2011 22:14:17 +0100, Sérgio Monteiro Basto wrote:


Exactly the opposite , if python don't know the encoding should not try
decode to ASCII.


What should it decode to, then?

You can't write characters to a stream, only bytes.


I want python don't care about encoding terminal and send characters as they
are or for a file .


You can't write characters to a stream, only bytes.


Characters, representations are for people, byte representations are for 
computers.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Mark Tolonen


Sérgio Monteiro Basto sergi...@sapo.pt wrote in message 
news:4df137a7$0$30580$a729d...@news.telepac.pt...



How I change sys.stdout.encoding always to UTF-8 ? at least have a
consistent sys.stdout.encoding


There is an environment variable that can force Python I/O to be a specfic 
encoding:


   PYTHONIOENCODING=utf-8

-Mark


--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Nobody wrote:

 Exactly the opposite , if python don't know the encoding should not try
 decode to ASCII.
 
 What should it decode to, then?

UTF-8, as in tty, how I change this default ? 

 You can't write characters to a stream, only bytes.
 
ok got the point . 
Thanks, 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Mark Tolonen wrote:

 
 Sérgio Monteiro Basto sergi...@sapo.pt wrote in message
 news:4df137a7$0$30580$a729d...@news.telepac.pt...
 
 How I change sys.stdout.encoding always to UTF-8 ? at least have a
 consistent sys.stdout.encoding
 
 There is an environment variable that can force Python I/O to be a specfic
 encoding:
 
 PYTHONIOENCODING=utf-8

Excellent thanks , double thanks.

BTW: should be set by default on a utf-8 systems like Fedora, Ubuntu, Debian 
, Redhat, and all Linuxs. For sure I will put this on startup of my systems.
 
 -Mark
--
Sérgio M. B.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Ben Finney
Sérgio Monteiro Basto sergi...@sapo.pt writes:

 Nobody wrote:

  Exactly the opposite , if python don't know the encoding should not
  try decode to ASCII.

Are you advocating that Python should refuse to write characters unless
the encoding is specified? I could sympathise with that, but currently
that's not what Python does; instead it defaults to the ASCII codec.

  What should it decode to, then?

 UTF-8, as in tty

But when you explicitly redirect to a file, it's *not* going to a TTY.
It's going to a file whose encoding isn't known unless you specify it.

-- 
 \ “Reality must take precedence over public relations, for nature |
  `\cannot be fooled.” —Richard P. Feynman |
_o__)  |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Sérgio Monteiro Basto
Ben Finney wrote:

  Exactly the opposite , if python don't know the encoding should not
  try decode to ASCII.
 
 Are you advocating that Python should refuse to write characters unless
 the encoding is specified? I could sympathise with that, but currently
 that's not what Python does; instead it defaults to the ASCII codec.

could be a solution ;) or a smarter default based on LANG for example (as 
many GNU does).

--
Sérgio M. B.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ../test.py
 moçambique
 moçambique


The following tries to encode before to print. If you pass an already 
utf-8 object, it just print it; if not it encode it. All the print 
statements pass by MyPrint.write


#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode(utf8)
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode(utf-8)
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py  test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the 
errors are printed by sys.stderr.write. So if you want to do

raise moçambique
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend Comprendre les erreurs 
unicode by Victor Stinner :

http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-09 Thread Laurent Claessens

Le 09/06/2011 04:18, Sérgio Monteiro Basto a écrit :
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ../test.py
 moçambique
 moçambique


The following tries to encode before to print. If you pass an already 
utf-8 object, it just print it; if not it encode it. All the print 
statements pass by MyPrint.write


#!/usr/bin/env python
#-*- coding: utf-8 -*-

import sys

class MyPrint(object):
def __init__(self):
self.old_stdout=sys.stdout
sys.stdout=self
def write(self,text):
try:
encoded=text.encode(utf8)
except UnicodeDecodeError:
encoded=text
self.old_stdout.write(encoded)


MyPrint()

u = u'moçambique'
print u.encode(utf-8)
print u

TEST :

$ ./test.py
moçambique
moçambique

$ ./test.py  test.txt
$ cat test.txt
moçambique
moçambique


By the way, my code will not help for error message. I think that the 
errors are printed by sys.stderr.write. So if you want to do

raise moçambique
you should think about add stderr to the class MyPrint


If you know French, I strongly recommend Comprendre les erreurs 
unicode by Victor Stinner :

http://dl.afpy.org/pycon-fr-09/Comprendre_les_erreurs_unicode.pdf

Have a nice day
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


the stupid encoding problem to stdout

2011-06-08 Thread Sérgio Monteiro Basto
hi,
cat test.py 
#!/usr/bin/env python
#-*- coding: utf-8 -*-
u = u'moçambique'
print u.encode(utf-8)
print u

chmod +x test.py
./test.py
moçambique
moçambique

./test.py  output.txt
Traceback (most recent call last):
  File ./test.py, line 5, in module
print u
UnicodeEncodeError: 'ascii' codec can't encode character 
u'\xe7' in position 2: ordinal not in range(128)

in python 2.7 
how I explain to python to send the same thing to stdout and 
the file output.txt ?

Don't seems logic, when send things to a file the beaviour 
change.

Thanks,
Sérgio M. B. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-08 Thread Ben Finney
Sérgio Monteiro Basto sergi...@sapo.pt writes:

 ./test.py
 moçambique
 moçambique

In this case your terminal is reporting its encoding to Python, and it's
capable of taking the UTF-8 data that you send to it in both cases.

 ./test.py  output.txt
 Traceback (most recent call last):
   File ./test.py, line 5, in module
 print u
 UnicodeEncodeError: 'ascii' codec can't encode character 
 u'\xe7' in position 2: ordinal not in range(128)

In this case your shell has no preference for the encoding (since you're
redirecting output to a file).

In the first print statement you specify the encoding UTF-8, which is
capable of encoding the characters.

In the second print statement you haven't specified any encoding, so the
default ASCII encoding is used.


Moral of the tale: Make sure an encoding is specified whenever data
steps between bytes and characters.

 Don't seems logic, when send things to a file the beaviour change.

They're different files, which have been opened with different
encodings. If you want a different encoding, you need to specify that.

-- 
 \   “There's no excuse to be bored. Sad, yes. Angry, yes. |
  `\Depressed, yes. Crazy, yes. But there's no excuse for boredom, |
_o__)  ever.” —Viggo Mortensen |
Ben Finney
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: the stupid encoding problem to stdout

2011-06-08 Thread Benjamin Kaplan
2011/6/8 Sérgio Monteiro Basto sergi...@sapo.pt:
 hi,
 cat test.py
 #!/usr/bin/env python
 #-*- coding: utf-8 -*-
 u = u'moçambique'
 print u.encode(utf-8)
 print u

 chmod +x test.py
 ./test.py
 moçambique
 moçambique

 ./test.py  output.txt
 Traceback (most recent call last):
  File ./test.py, line 5, in module
    print u
 UnicodeEncodeError: 'ascii' codec can't encode character
 u'\xe7' in position 2: ordinal not in range(128)

 in python 2.7
 how I explain to python to send the same thing to stdout and
 the file output.txt ?

 Don't seems logic, when send things to a file the beaviour
 change.

 Thanks,
 Sérgio M. B.

That's not a terminal vs file thing. It's a file that declares it's
encoding vs a file that doesn't declare it's encoding thing. Your
terminal declares that it is UTF-8. So when you print a Unicode string
to your terminal, Python knows that it's supposed to turn it into
UTF-8. When you pipe the output to a file, that file doesn't declare
an encoding. So rather than guess which encoding you want, Python
defaults to the lowest common denominator: ASCII. If you want
something to be a particular encoding, you have to encode it yourself.

You have a couple of choices on how to make it work:
1) Play dumb and always encode as UTF-8. This would look really weird
if someone tried running your program in a terminal with a CP-847
encoding (like cmd.exe on at least the US version of Windows), but it
would never crash.
2) Check sys.stdout.encoding. If it's ascii, then encode your unicode
string in the string-escape encoding, which substitutes the escape
sequence in for all non-ASCII characters.
3) Check to see if sys.stdout.isatty() and have different behavior for
terminals vs files. If you're on a terminal that doesn't declare its
encoding, encoding it as UTF-8 probably won't help. If you're writing
to a file, that might be what you want to do.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem when launching Python27 via DOS

2011-04-11 Thread Jean-Pierre M
Thanks a lot for this quick answer! It is very clear!

Ti better understand what the difference between encoding and decoding is I
found the following website: http://www.evanjones.ca/python-utf8.html

http://www.evanjones.ca/python-utf8.htmlI change the program to (changes
are in bold):
*# -*- coding: utf8 -*- *(no more cp1252 the source file is directly in
unicode)
*#!/usr/bin/python*
*'''*
*Created on 27 déc. 2010*
*
*
*@author: jpmena*
*'''*
*from datetime import datetime*
*import locale*
*import codecs*
*import os,sys*
*
*
*class Log(object):*
*log=None*
*def __init__(self,log_path):*
*self.log_path=log_path*
*if(os.path.exists(self.log_path)):*
*os.remove(self.log_path)*
*#self.log=open(self.log_path,'a')*
*self.log=codecs.open(self.log_path, a, 'utf-8')*
**
*def getInstance(log_path=None):*
*print encodage systeme:+sys.getdefaultencoding()*
*if Log.log is None:*
*if log_path is None:*
*log_path=os.path.join(os.getcwd(),'logParDefaut.log')*
*Log.log=Log(log_path)*
*return Log.log*
**
*getInstance=staticmethod(getInstance)*
**
*def p(self,msg):*
*aujour_dhui=datetime.now()*
*date_stamp=aujour_dhui.strftime(%d/%m/%y-%H:%M:%S)*
*print sys.getdefaultencoding()*
*unicode_str='%s : %s \n'  % (date_stamp,unicode(msg,'utf-8'))*
*#unicode_str=msg*
*self.log.write(unicode_str)*
*return unicode_str*
**
*def close(self):*
*self.log.flush()*
*self.log.close()*
*return self.log_path*
*
*
*if __name__ == '__main__':*
*l=Log.getInstance()*
*l.p(premier message de Log à accents)*
*Log.getInstance().p(second message de Log)*
*l.close()*

The DOS conole output is now:
*C:\Documents and Settings\jpmena\Mes
documents\VelocityRIF\VelocityTransformsgenerationProgrammeSitePublicActuel.cmd
*
*Page de codes active : 1252*
*encodage systeme:ascii*
*ascii*
*encodage systeme:ascii*
*ascii*

And the Generated Log file showsnow the expected result:
*11/04/11-10:53:44 : premier message de Log à accents *
*11/04/11-10:53:44 : second message de Log*

Thanks.

If you have other links of interests about unicode encoding and decoding  in
Python. They are welcome

2011/4/10 MRAB pyt...@mrabarnett.plus.com

 On 10/04/2011 13:22, Jean-Pierre M wrote:
  I created a simple program which writes in a unicode files some french
 text with accents!
 [snip]
 This line:


l.p(premier message de Log à accents)

 passes a bytestring to the method, and inside the method, this line:


unicode_str=u'%s : %s \n'  %
 (date_stamp,msg.encode(self.charset_log,'replace'))

 it tries to encode the bytestring to Unicode.

 It's not possible to encode a bytestring, only a Unicode string, so
 Python tries to decode the bytestring using the fallback encoding
 (ASCII) and then encode the result.

 Unfortunately, the bytestring isn't ASCII (it contains accented
 characters), so it can't be decoded as ASCII, hence the exception.

 BTW, it's probably better to forget about cp1252, etc, and use UTF-8
 instead, and also to use Unicode wherever possible.
 --
 http://mail.python.org/mailman/listinfo/python-list

-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding problem when launching Python27 via DOS

2011-04-10 Thread Jean-Pierre M
I created a simple program which writes in a unicode files some french text
with accents!
*# -*- coding: cp1252 -*-*
*#!/usr/bin/python*
*'''*
*Created on 27 déc. 2010*
*
*
*@author: jpmena*
*'''*
*from datetime import datetime*
*import locale*
*import codecs*
*import os,sys*
*
*
*class Log(object):*
*log=None*
*def __init__(self,log_path,charset_log=None):*
*self.log_path=log_path*
*if(os.path.exists(self.log_path)):*
*os.remove(self.log_path)*
*#self.log=open(self.log_path,'a')*
*if charset_log is None:*
*self.charset_log=sys.getdefaultencoding()*
*else:*
*self.charset_log=charset_log*
*self.log=codecs.open(self.log_path, a, charset_log)*
**
*def getInstance(log_path=None):*
*print encodage systeme:+sys.getdefaultencoding()*
*if Log.log is None:*
*if log_path is None:*
*log_path=os.path.join(os.getcwd(),'logParDefaut.log')*
*Log.log=Log(log_path)*
*return Log.log*
**
*getInstance=staticmethod(getInstance)*
**
*def p(self,msg):*
*aujour_dhui=datetime.now()*
*date_stamp=aujour_dhui.strftime(%d/%m/%y-%H:%M:%S)*
*print sys.getdefaultencoding()*
*unicode_str=u'%s : %s \n'  %
(date_stamp,msg.encode(self.charset_log,'replace'))*
*self.log.write(unicode_str)*
*return unicode_str*
**
*def close(self):*
*self.log.flush()*
*self.log.close()*
*return self.log_path*
*
*
*if __name__ == '__main__':*
*l=Log.getInstance()*
*l.p(premier message de Log à accents)*
*Log.getInstance().p(second message de Log)*
*l.close()*

I am using PyDev/Aptana for developping. Il Aptana lanches the program
everything goes well!!! sys.getdefaultencoding() answers 'cp1252'

But if I execute the following batch file in a DOS console on my Windows
VISTA:

*@echo off*
*setlocal*
*chcp 1252*
*set PYTHON_HOME=C:\Python27*
*for /F tokens=1-4 delims=/  %%i in ('date /t') do (*
*if %%l== (*
*:: Windows XP*
*set D=%%k%%j%%i*
* ) else (*
*:: Windows NT/2000*
*set D=%%l%%k%%j*
* )*
*)*
*set PYTHONIOENCODING=cp1252:backslashreplace*
*%PYTHON_HOME%\python.exe %~dp0\src\utils\Log.py*

the answer is:
*C:\Users\jpmena\Documents\My
Dropbox\RIF\Python\VelocityTransformsgenerationPro*
*grammeSitePublicActuel.cmd*
*Page de codes active : 1252*
*encodage systeme:ascii*
*ascii*
*Traceback (most recent call last):*
*  File C:\Users\jpmena\Documents\My
Dropbox\RIF\Python\VelocityTransforms\\src\*
*utils\Log.py, line 51, in module*
*l.p(premier message de Log à accents)*
*  File C:\Users\jpmena\Documents\My
Dropbox\RIF\Python\VelocityTransforms\\src\*
*utils\Log.py, line 40, in p*
*unicode_str=u'%s : %s \n'  %
(date_stamp,msg.encode(self.charset_log,'replac*
*e'))*
*UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 23:
ordinal*
* not in range(128)*

sys.getdefaultencoding answers ascii so the encode function cannot encode
the accent in 'à'


I am using Python27 because it is compatible with the actual versions of
pyodbc (for accessinf a ACCESS database) and airspeed (Velocity Templates in
utf-8)

The target is to launch airspeed applications via the Windows CRON

Can someone help me. I am really stuck!

Thanks...
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem when launching Python27 via DOS

2011-04-10 Thread MRAB

On 10/04/2011 13:22, Jean-Pierre M wrote:
 I created a simple program which writes in a unicode files some 
french text with accents!

[snip]
This line:

l.p(premier message de Log à accents)

passes a bytestring to the method, and inside the method, this line:

unicode_str=u'%s : %s \n'  % 
(date_stamp,msg.encode(self.charset_log,'replace'))


it tries to encode the bytestring to Unicode.

It's not possible to encode a bytestring, only a Unicode string, so
Python tries to decode the bytestring using the fallback encoding
(ASCII) and then encode the result.

Unfortunately, the bytestring isn't ASCII (it contains accented
characters), so it can't be decoded as ASCII, hence the exception.

BTW, it's probably better to forget about cp1252, etc, and use UTF-8
instead, and also to use Unicode wherever possible.
--
http://mail.python.org/mailman/listinfo/python-list


Re: nntplib encoding problem

2011-02-28 Thread Laurent Duchesne

Hi,

Thanks it's working!
But is it normal for a string coming out of a module (nntplib) to 
crash when passed to print or write?


I'm just asking to know if I should open a bug report or not :)

I'm also wondering which strings should be re-encoded using the 
surrogateescape parameter and which should not.. I guess I could 
reencode them all and it wouldn't cause any problems?


Laurent

On Mon, 28 Feb 2011 02:12:20 +, MRAB wrote:

On 28/02/2011 01:31, Laurent Duchesne wrote:

Hi,

I'm using python 3.2 and got the following error:


nntpClient = nntplib.NNTP_SSL(...)
nntpClient.group(alt.binaries.cd.lossless)
nntpClient.over((534157,534157))
... 'subject': 'Myl\udce8ne Farmer - Anamorphosee (Japan Edition) 
1995

[02/41] Back.jpg yEnc (1/3)' ...

overview = nntpClient.over((534157,534157))
print(overview[1][0][1]['subject'])

Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed

I'm not sure if I should report this as a bug in nntplib or if I'm 
doing

something wrong.

Note that I get the same error if I try to write this data to a 
file:



h = open(output.txt, a)
h.write(overview[1][0][1]['subject'])

Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed


It's looks like the subject was originally encoded as Latin-1 (or
similar) (b'Myl\xe8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] Back.jpg yEnc (1/3)') but has been decoded as UTF-8 with
surrogateescape passed as the errors parameter.

You can get the correct Unicode by encoding as UTF-8 with
surrogateescape and then decoding as Latin-1:

overview[1][0][1]['subject'].encode(utf-8,
surrogateescape).decode(latin-1)


--
http://mail.python.org/mailman/listinfo/python-list


nntplib encoding problem

2011-02-27 Thread Laurent Duchesne

Hi,

I'm using python 3.2 and got the following error:


nntpClient = nntplib.NNTP_SSL(...)
nntpClient.group(alt.binaries.cd.lossless)
nntpClient.over((534157,534157))
... 'subject': 'Myl\udce8ne Farmer - Anamorphosee (Japan Edition) 1995 
[02/41] Back.jpg yEnc (1/3)' ...

overview = nntpClient.over((534157,534157))
print(overview[1][0][1]['subject'])

Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in 
position 3: surrogates not allowed


I'm not sure if I should report this as a bug in nntplib or if I'm 
doing something wrong.


Note that I get the same error if I try to write this data to a file:


h = open(output.txt, a)
h.write(overview[1][0][1]['subject'])

Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in 
position 3: surrogates not allowed


Thanks,
Laurent
--
http://mail.python.org/mailman/listinfo/python-list


Re: nntplib encoding problem

2011-02-27 Thread MRAB

On 28/02/2011 01:31, Laurent Duchesne wrote:

Hi,

I'm using python 3.2 and got the following error:


nntpClient = nntplib.NNTP_SSL(...)
nntpClient.group(alt.binaries.cd.lossless)
nntpClient.over((534157,534157))

... 'subject': 'Myl\udce8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] Back.jpg yEnc (1/3)' ...

overview = nntpClient.over((534157,534157))
print(overview[1][0][1]['subject'])

Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed

I'm not sure if I should report this as a bug in nntplib or if I'm doing
something wrong.

Note that I get the same error if I try to write this data to a file:


h = open(output.txt, a)
h.write(overview[1][0][1]['subject'])

Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed


It's looks like the subject was originally encoded as Latin-1 (or
similar) (b'Myl\xe8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] Back.jpg yEnc (1/3)') but has been decoded as UTF-8 with
surrogateescape passed as the errors parameter.

You can get the correct Unicode by encoding as UTF-8 with
surrogateescape and then decoding as Latin-1:

overview[1][0][1]['subject'].encode(utf-8, 
surrogateescape).decode(latin-1)

--
http://mail.python.org/mailman/listinfo/python-list


Re: nntplib encoding problem

2011-02-27 Thread Thomas L. Shinnick

At 08:12 PM 2/27/2011, you wrote:

On 28/02/2011 01:31, Laurent Duchesne wrote:

Hi,

I'm using python 3.2 and got the following error:


nntpClient = nntplib.NNTP_SSL(...)
nntpClient.group(alt.binaries.cd.lossless)
nntpClient.over((534157,534157))

... 'subject': 'Myl\udce8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] Back.jpg yEnc (1/3)' ...

overview = nntpClient.over((534157,534157))
print(overview[1][0][1]['subject'])

Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed

I'm not sure if I should report this as a bug in nntplib or if I'm doing
something wrong.

Note that I get the same error if I try to write this data to a file:


h = open(output.txt, a)
h.write(overview[1][0][1]['subject'])

Traceback (most recent call last):
File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
position 3: surrogates not allowed

It's looks like the subject was originally encoded as Latin-1 (or
similar) (b'Myl\xe8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] Back.jpg yEnc (1/3)') but has been decoded as UTF-8 with
surrogateescape passed as the errors parameter.


3.2 Docs
  6.6. codecs — Codec registry and base classes
Possible values for errors are
  'surrogateescape': replace with surrogate U+DCxx, see PEP 383

Yes, it would have been 0xE8 -  Mylène

Googling on surrogateescape I can see lots of 
argument about unintended outcomes  yikes!



You can get the correct Unicode by encoding as UTF-8 with
surrogateescape and then decoding as Latin-1:


overview[1][0][1]['subject'].encode(utf-8, 
surrogateescape).decode(latin-1)
-- 
http://mail.python.org/mailman/listinfo/python-list


Encoding problem - or bug in couchdb-0.8-py2.7.egg??

2010-09-20 Thread Ian Hobson

Hi all,

I have hit a problem and I don't know enough about python to diagnose 
things further. Trying to use couchDB from Python. This script:-


# coding=utf8
import couchdb
from couchdb.client import Server
server = Server()
dbName = 'python-tests'
try:
db = server.create(dbName)
except couchdb.PreconditionFailed:
del server[dbName]
db = server.create(dbName)
doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})

Gives this traceback:-

D:\work\C-U-Bpython tes1.py
Traceback (most recent call last):
  File tes1.py, line 11, in module
doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})
  File 
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py, 
line 407, in save

_, _, data = func(body=doc, **options)
  File 
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, 
line 399, in post_json

status, headers, data = self.post(*a, **k)
  File 
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, 
line 381, in post

**params)
  File 
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, 
line 419, in _request

credentials=self.credentials)
  File 
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py, 
line 310, in request

raise ServerError((status, error))
couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON'))

D:\work\C-U-B

Why? I've tried adding u to the strings, and removing the # coding line, 
and I still get the same error.


Thanks for any help.

Ian


--
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem - or bug in couchdb-0.8-py2.7.egg??

2010-09-20 Thread Diez B. Roggisch
Ian Hobson i...@ianhobson.co.uk writes:

 Hi all,

 I have hit a problem and I don't know enough about python to diagnose
 things further. Trying to use couchDB from Python. This script:-

 # coding=utf8
 import couchdb
 from couchdb.client import Server
 server = Server()
 dbName = 'python-tests'
 try:
 db = server.create(dbName)
 except couchdb.PreconditionFailed:
 del server[dbName]
 db = server.create(dbName)
 doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})

 Gives this traceback:-

 D:\work\C-U-Bpython tes1.py
 Traceback (most recent call last):
   File tes1.py, line 11, in module
 doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py,
 line 407, in save
 _, _, data = func(body=doc, **options)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 399, in post_json
 status, headers, data = self.post(*a, **k)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 381, in post
 **params)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 419, in _request
 credentials=self.credentials)
   File
 C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
 line 310, in request
 raise ServerError((status, error))
 couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON'))

 D:\work\C-U-B

 Why? I've tried adding u to the strings, and removing the # coding
 line, and I still get the same error.

Sounds cargo-cultish. I suggest you read the python introduction on
unicode.

 http://docs.python.org/howto/unicode.html

For your actual problem, I have difficulties seeing how it can happen
with the above data - frankly because there is nothing outside the
ascii-range of data, so there is no reason why anything could be wrong
encoded.

But googling the error-message reveals that there seem to be totally
unrelated reasons for this:

  http://sindro.me/2010/4/3/couchdb-invalid-utf8-json

Maybe using something like tcpmon or ethereal to capture the actual
HTTP-request helps to see where the issue comes from.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encoding problem - or bug in couchdb-0.8-py2.7.egg??

2010-09-20 Thread Ian

 Thanks Diez,

Removing, rebooting and installing the latest version solved the 
problem.  :)


Your google-foo is better than mine.  Google had not turned that up for me.

Thanks again

Regards

Ian



On 20/09/2010 17:00, Diez B. Roggisch wrote:

Ian Hobsoni...@ianhobson.co.uk  writes:


Hi all,

I have hit a problem and I don't know enough about python to diagnose
things further. Trying to use couchDB from Python. This script:-

# coding=utf8
import couchdb
from couchdb.client import Server
server = Server()
dbName = 'python-tests'
try:
 db = server.create(dbName)
except couchdb.PreconditionFailed:
 del server[dbName]
 db = server.create(dbName)
doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})

Gives this traceback:-

D:\work\C-U-Bpython tes1.py
Traceback (most recent call last):
   File tes1.py, line 11, inmodule
 doc_id, doc_rev = db.save({'type': 'Person', 'name': 'John Doe'})
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\client.py,
line 407, in save
 _, _, data = func(body=doc, **options)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 399, in post_json
 status, headers, data = self.post(*a, **k)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 381, in post
 **params)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 419, in _request
 credentials=self.credentials)
   File
C:\Python27\lib\site-packages\couchdb-0.8-py2.7.egg\couchdb\http.py,
line 310, in request
 raise ServerError((status, error))
couchdb.http.ServerError: (400, ('bad_request', 'invalid UTF-8 JSON'))

D:\work\C-U-B

Why? I've tried adding u to the strings, and removing the # coding
line, and I still get the same error.

Sounds cargo-cultish. I suggest you read the python introduction on
unicode.

  http://docs.python.org/howto/unicode.html

For your actual problem, I have difficulties seeing how it can happen
with the above data - frankly because there is nothing outside the
ascii-range of data, so there is no reason why anything could be wrong
encoded.

I came to the same conclusion.

But googling the error-message reveals that there seem to be totally
unrelated reasons for this:

   http://sindro.me/2010/4/3/couchdb-invalid-utf8-json

Maybe using something like tcpmon or ethereal to capture the actual
HTTP-request helps to see where the issue comes from.

Diez


--
http://mail.python.org/mailman/listinfo/python-list


encoding problem

2009-06-27 Thread netpork
Hello,

I have ssl socket with server and client, on my development machine
everything works pretty well.
Database which I have to use is mssql on ms server 2003, so I decided
to install the same python config there and run my python server
script.

Now here is the problem, server is returning strange characters
although default encoding is the same on both development and server
machines.


Any hints?

Thanks in advance.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2009-06-27 Thread Piet van Oostrum
 netpork todorovic.de...@gmail.com (n) wrote:

n Hello,
n I have ssl socket with server and client, on my development machine
n everything works pretty well.
n Database which I have to use is mssql on ms server 2003, so I decided
n to install the same python config there and run my python server
n script.

n Now here is the problem, server is returning strange characters
n although default encoding is the same on both development and server
n machines.


n Any hints?

Yes, read http://catb.org/esr/faqs/smart-questions.html
-- 
Piet van Oostrum p...@cs.uu.nl
URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4]
Private email: p...@vanoostrum.org
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: encoding problem

2009-06-27 Thread dejan todorović
It was problem with pymssql that not supports unicode, switched to
pyodbc, everything is fine.

Thanks for your swift reply. ;)



On Jun 27, 7:44 pm, Piet van Oostrum p...@cs.uu.nl wrote:
  netpork todorovic.de...@gmail.com (n) wrote:
 n Hello,
 n I have ssl socket with server and client, on my development machine
 n everything works pretty well.
 n Database which I have to use is mssql on ms server 2003, so I decided
 n to install the same python config there and run my python server
 n script.
 n Now here is the problem, server is returning strange characters
 n although default encoding is the same on both development and server
 n machines.
 n Any hints?

 Yes, readhttp://catb.org/esr/faqs/smart-questions.html
 --
 Piet van Oostrum p...@cs.uu.nl
 URL:http://pietvanoostrum.com[PGP 8DAE142BE17999C4]
 Private email: p...@vanoostrum.org

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: SyntaxError: encoding problem: with BOM

2008-12-26 Thread Gabriel Genellina

En Thu, 25 Dec 2008 11:55:16 -0200, NoName zaz...@gmail.com escribió:


Error

C:\Documents and Settings\Ra\Рабочий стол11.py
  File decoding error, line 1
SyntaxError: encoding problem: with BOM

No error

C:\Documents and Settings\Ra\Рабочий столpython 11.py
test

Error when russian symbols in full path to py-script.
Is it Python bug? or i need to modify some registry keys?


Yes, it's a bug. The encoding declaration may be anything, ascii, even an
inexistent codec will trigger the bug. Any non-ascii character in the
script name or path provokes then a SyntaxError when the script is
executed directly.
As a workaround, avoid using any Russian characters in directory names or
script file names, or invoke them always using python xxx.py, not
directly.


OS: WinXP SP3 Russian.
Python 3.0 (r30:67507, Dec  3 2008, 20:14:27) [MSC v.1500 32 bit
(Intel)] on win32


My tests were on WinXP SP3 Spanish.
See http://bugs.python.org/issue4747

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


  1   2   >