subject:"Unicode error"

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-12 Thread zljubisicmob

I would say so as well.
Thanks to everyone who helped.

Regards and best wishes.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-10 Thread Dave Angel


On 05/10/2015 05:10 PM, zljubisic...@gmail.com wrote:

No, we can't see what ROOTDIR is, since you read it from the config
file.  And you don't show us the results of those prints.  You don't
even show us the full exception, or even the line it fails on.


Sorry I forgot. This is the output of the script:

C:\Python34\python.exe C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py
C:\Users\zoran\hrt
Traceback (most recent call last):
   File "C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py", line 43, in 

 with open(src_file, mode='w', encoding='utf-8') as s_file:
FileNotFoundError: [Errno 2] No such file or directory: 
'C:\\Users\\zoran\\hrt\\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt'
70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o
260 
C:\Users\zoran\hrt\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt
260 
C:\Users\zoran\hrt\des_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt

Process finished with exit code 1

Cfg file has the following contents:

C:\Users\zoran\PycharmProjects\mm_align\hrt3.cfg contents
[Dir]
ROOTDIR = C:\Users\zoran\hrt


I doubt that the problem is in the ROODIR value, but of course nothing
in your program bothers to check that that directory exists.  I expect
you either have too many characters total, or the 232th character is a
strange one.  Or perhaps title has a backslash in it (you took care of
forward slash).


How to determine that?


Probably by calling os.path.isdir()




While we're at it, if you do have an OS limitation on size, your code is
truncating at the wrong point.  You need to truncate the title based on
the total size of src_file and dst_file, and since the code cannot know
the size of ROOTDIR, you need to include that in your figuring.


Well, in my program I am defining a file name as category-id-description.mp3.
If the file is too long I am cutting description (it wasn't clear from my 
example).


Since you've got non-ASCII characters in that name, the utf-8 version of 
the name will be longer.  I don't run Windows, but perhaps it's just a 
length problem after all.




--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-10 Thread zljubisicmob

> No, we can't see what ROOTDIR is, since you read it from the config 
> file.  And you don't show us the results of those prints.  You don't 
> even show us the full exception, or even the line it fails on.

Sorry I forgot. This is the output of the script:

C:\Python34\python.exe C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py
C:\Users\zoran\hrt
Traceback (most recent call last):
  File "C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py", line 43, in 

with open(src_file, mode='w', encoding='utf-8') as s_file:
FileNotFoundError: [Errno 2] No such file or directory: 
'C:\\Users\\zoran\\hrt\\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt'
70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o
260 
C:\Users\zoran\hrt\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt
260 
C:\Users\zoran\hrt\des_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt

Process finished with exit code 1

Cfg file has the following contents:

C:\Users\zoran\PycharmProjects\mm_align\hrt3.cfg contents
[Dir]
ROOTDIR = C:\Users\zoran\hrt 

> I doubt that the problem is in the ROODIR value, but of course nothing 
> in your program bothers to check that that directory exists.  I expect 
> you either have too many characters total, or the 232th character is a 
> strange one.  Or perhaps title has a backslash in it (you took care of 
> forward slash).

How to determine that?
 
> While we're at it, if you do have an OS limitation on size, your code is 
> truncating at the wrong point.  You need to truncate the title based on 
> the total size of src_file and dst_file, and since the code cannot know 
> the size of ROOTDIR, you need to include that in your figuring.

Well, in my program I am defining a file name as category-id-description.mp3.
If the file is too long I am cutting description (it wasn't clear from my 
example).

Regards.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-10 Thread zljubisicmob

> > It works, but if you change title = title[:232] to title = title[:233],
> > you will get "FileNotFoundError: [Errno 2] No such file or directory".
> 
> 
> Which is a *completely different* error from 
> 
> SyntaxError: 'unicodeescape' codec can't decode bytes in position 2-3:
> truncated \U escape

I don't know when the original error disappeared and become this one (confused).

Regards.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-09 Thread Chris Angelico

On Sun, May 10, 2015 at 1:13 AM, Steven D'Aprano
 wrote:
> FileNotFoundError means that the program did run, it tried to open a file,
> but the file doesn't exist.

Normally it does, at least. Sometimes it means that a *directory*
doesn't exist (for instance, you can get this when you try to create a
new file, which otherwise wouldn't make sense), and occasionally,
Windows will give you rather peculiar errors when weird things go
wrong, which may be what's going on here (maximum path length - though
that can be overridden by switching to a UNC-style path).

Steven's point still stands - very different from SyntaxError - but
unfortunately it's not always as simple as the name suggests. Thank
you oh so much, Windows.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-09 Thread Steven D'Aprano

On Sat, 9 May 2015 08:31 pm, zljubisic...@gmail.com wrote:

> It works, but if you change title = title[:232] to title = title[:233],
> you will get "FileNotFoundError: [Errno 2] No such file or directory".

Which is a *completely different* error from 

SyntaxError: 'unicodeescape' codec can't decode bytes in position 2-3:
truncated \U escape

> As you can see ROOTDIR contains \U.

How can I possibly see that? Your code reads ROOTDIR from the config file,
which you don't show us.

I agree with you that Windows has limitations on the length of file names,
and that you get an error if you give a file name that cannot be found. The
point is that before you can get that far, you *first* have to fix the
SyntaxError. That's a completely different problem.

You can't fix the \U syntax error by truncating the total file length. But
you can fix that syntax error by changing your code so it reads the ROOTDIR
from a config file instead of a hard-coded string literal -- exactly like
we told you to do!

An essential skill when programming is to read and understand the error
messages. One of the most painful things to use is a programming language
that just says 

"An error occurred"

with no other explanation. Python gives you lots of detail to explain what
went wrong:

SyntaxError means you made an error in the syntax of the code and the
program cannot even run.

FileNotFoundError means that the program did run, it tried to open a file,
but the file doesn't exist.

They're a little bit different, don't you agree?

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-09 Thread Dave Angel


On 05/09/2015 06:31 AM, zljubisic...@gmail.com wrote:


title = title[:232]
title = title.replace(" ", "_").replace("/", "_").replace("!", "_").replace("?", 
"_")\
 .replace('"', "_").replace(':', "_").replace(',', 
"_").replace('"', '')\
 .replace('\n', '_').replace(''', '')

print(title)

src_file = os.path.join(ROOTDIR, 'src_' + title + '.txt')
dst_file = os.path.join(ROOTDIR, 'des_' + title + '.txt')

print(len(src_file), src_file)
print(len(dst_file), dst_file)

with open(src_file, mode='w', encoding='utf-8') as s_file:
 s_file.write('test')


shutil.move(src_file, dst_file)

It works, but if you change title = title[:232] to title = title[:233], you will get 
"FileNotFoundError: [Errno 2] No such file or directory".
As you can see ROOTDIR contains \U.


No, we can't see what ROOTDIR is, since you read it from the config 
file.  And you don't show us the results of those prints.  You don't 
even show us the full exception, or even the line it fails on.


I doubt that the problem is in the ROODIR value, but of course nothing 
in your program bothers to check that that directory exists.  I expect 
you either have too many characters total, or the 232th character is a 
strange one.  Or perhaps title has a backslash in it (you took care of 
forward slash).


While we're at it, if you do have an OS limitation on size, your code is 
truncating at the wrong point.  You need to truncate the title based on 
the total size of src_file and dst_file, and since the code cannot know 
the size of ROOTDIR, you need to include that in your figuring.





--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-09 Thread zljubisicmob

Steven,

please do look at the code bellow:

# C:\Users\zoran\PycharmProjects\mm_align\hrt3.cfg contents
# [Dir]
# ROOTDIR = C:\Users\zoran\hrt


import os
import shutil
import configparser
import requests
import re

Config = configparser.ConfigParser()
Config.optionxform = str # preserve case in ini file
cfg_file = os.path.join('C:\\Users\\zoran\\PycharmProjects\\mm_align\\hrt3.cfg' 
)
Config.read(cfg_file)



ROOTDIR = Config.get('Dir', 'ROOTDIR')

print(ROOTDIR)

html = 
requests.get("http://radio.hrt.hr/prvi-program/arhiva/ujutro-prvi-poligraf-politicki-grafikon/118/";).text

art_html = re.search('(.+?)', html, 
re.DOTALL).group(1)
for p_tag in re.finditer(r'(.*?)', art_html, re.DOTALL):
if '' not in p_tag.group(1):
title = p_tag.group(1)

title = title[:232]
title = title.replace(" ", "_").replace("/", "_").replace("!", 
"_").replace("?", "_")\
.replace('"', "_").replace(':', "_").replace(',', 
"_").replace('"', '')\
.replace('\n', '_').replace(''', '')

print(title)

src_file = os.path.join(ROOTDIR, 'src_' + title + '.txt')
dst_file = os.path.join(ROOTDIR, 'des_' + title + '.txt')

print(len(src_file), src_file)
print(len(dst_file), dst_file)

with open(src_file, mode='w', encoding='utf-8') as s_file:
s_file.write('test')


shutil.move(src_file, dst_file)

It works, but if you change title = title[:232] to title = title[:233], you 
will get "FileNotFoundError: [Errno 2] No such file or directory".
As you can see ROOTDIR contains \U.

Regards.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-08 Thread Steven D'Aprano

On Sat, 9 May 2015 06:39 am, zljubisic...@gmail.com wrote:

> Thanks for clarifying.
> Looks like the error message was wrong.

No, the error message was right.

Your problem was that you used backslashes in *Python program code*, rather
than reading it from a text file.

In Python, a string-literal containing \U is an escape sequence which
expects exactly 8 hexadecimal digits to follow:

py> path = '\U00a7'
py> print(path)
§

If you don't follow the \U with eight hex digits, you get an error:

py> path = '\Users~~~~'
  File "", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 4-6: truncated \U escape

This applies only to string literals in code. For data read from files,
backslash \ is just an ordinary character which has no special meaning.

> On windows ntfs I had a file name more than 259 characters which is widows
> limit. After cutting file name to 259 characters everything works as it
> should. If I cut file name to 260 characters I get the error from subject
> which is wrong.

What you describe is impossible. You cannot possibly get a SyntaxError at
compile time because the path is too long. You must have made other changes
at the same time, such as using a raw string r'C: ... \Users\ ...'.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-08 Thread Chris Angelico

On Sat, May 9, 2015 at 5:00 AM,   wrote:
> But it returns the following error:
>
>
> C:\Python34\python.exe C:/Users/bckslash_test.py
>   File "C:/Users/bckslash_test.py", line 4
> ROOTDIR = 'C:\Users'
>  ^
> SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
> position 2-3: truncated \U escape

Strong suggestion: Use forward slashes for everything other than what
you show to a human - and maybe even then (some programs have always
printed stuff out that way - zip/unzip, for instance). The backslash
has special meaning in many contexts, and you'll just save yourself so
much trouble...

ROOTDIR = 'C:/Users/zoran'

Problem solved!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-08 Thread zljubisicmob

Thanks for clarifying.
Looks like the error message was wrong.
On windows ntfs I had a file name more than 259 characters which is widows 
limit.
After cutting file name to 259 characters everything works as it should.
If I cut file name to 260 characters I get the error from subject which is 
wrong.

Anyway case closed, thank you very much because I was suspecting that something 
is wrong with configparser.

Best regards.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-08 Thread MRAB


On 2015-05-08 20:00, zljubisic...@gmail.com wrote:

The script is very simple (abc.txt exists in ROOTDIR directory):

import os
import shutil

ROOTDIR = 'C:\Users\zoran'

file1 = os.path.join(ROOTDIR, 'abc.txt')
file2 = os.path.join(ROOTDIR, 'def.txt')

shutil.move(file1, file2)


But it returns the following error:


C:\Python34\python.exe C:/Users/bckslash_test.py
   File "C:/Users/bckslash_test.py", line 4
 ROOTDIR = 'C:\Users'
  ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 2-3: truncated \U escape

Process finished with exit code 1

As I saw, I could solve the problem by changing line 4 to (small letter "r" 
before string:
ROOTDIR = r'C:\Users\zoran'

but that is not an option for me because I am using configparser in order to 
read the ROOTDIR from underlying cfg file.

I need a mechanism to read the path string with single backslashes into a 
variable, but afterwards to escape every backslash in it.

How to do that?


If you're reading the path from a file, it's not a problem. Try it!

--
https://mail.python.org/mailman/listinfo/python-list

Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-08 Thread random832

On Fri, May 8, 2015, at 15:00, zljubisic...@gmail.com wrote:
> As I saw, I could solve the problem by changing line 4 to (small letter
> "r" before string:
> ROOTDIR = r'C:\Users\zoran'
> 
> but that is not an option for me because I am using configparser in order
> to read the ROOTDIR from underlying cfg file.

configparser won't have that problem, since "escaping" is only an issue
for python source code. No escaping for backslashes is necessary in
files read by configparser.

>>> import sys
>>> import configparser
>>> config = configparser.ConfigParser()
>>> config['DEFAULT'] = {'ROOTDIR': r'C:\Users\zoran'}
>>> config.write(sys.stdout)
[DEFAULT]
rootdir = C:\Users\zoran
-- 
https://mail.python.org/mailman/listinfo/python-list

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

2015-05-08 Thread zljubisicmob

The script is very simple (abc.txt exists in ROOTDIR directory):

import os
import shutil

ROOTDIR = 'C:\Users\zoran'

file1 = os.path.join(ROOTDIR, 'abc.txt')
file2 = os.path.join(ROOTDIR, 'def.txt')

shutil.move(file1, file2)


But it returns the following error:


C:\Python34\python.exe C:/Users/bckslash_test.py
  File "C:/Users/bckslash_test.py", line 4
ROOTDIR = 'C:\Users'
 ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in 
position 2-3: truncated \U escape

Process finished with exit code 1

As I saw, I could solve the problem by changing line 4 to (small letter "r" 
before string:
ROOTDIR = r'C:\Users\zoran'

but that is not an option for me because I am using configparser in order to 
read the ROOTDIR from underlying cfg file.

I need a mechanism to read the path string with single backslashes into a 
variable, but afterwards to escape every backslash in it. 

How to do that?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: API for custom Unicode error handlers

2013-10-04 Thread Terry Reedy


On 10/4/2013 3:35 PM, Serhiy Storchaka wrote:

04.10.13 16:56, Steven D'Aprano написав(ла):

I have some custom Unicode error handlers, and I'm looking for advice on
the right API for dealing with them.



I'm planning to built this error handler in 3.4 (see
http://comments.gmane.org/gmane.comp.python.ideas/21296).



Should the module holding the error handlers automatically register them?


This question interesting me too.


I did not respond on the p-i thread, but +1 for 'namereplace' also. Like 
others, I would prefer auto-register unless that creates a problem. If 
it is a problem, perhaps the registry mechanism needs improvement. On 
the other hand, it is it built-in, it will be pre-registered.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: API for custom Unicode error handlers

2013-10-04 Thread Serhiy Storchaka


04.10.13 16:56, Steven D'Aprano написав(ла):

I have some custom Unicode error handlers, and I'm looking for advice on
the right API for dealing with them.

I have a module containing custom Unicode error handlers. For example:

# Python 3
import unicodedata
def namereplace_errors(exc):
 c = exc.object[exc.start]
 try:
 name = unicodedata.name(c)
 except (KeyError, ValueError):
 n = ord(c)
 if n <= 0x:
 replace = "\\u%04x"
 else:
 assert n <= 0x10
 replace = "\\U%08x"
 replace = replace % n
 else:
 replace = "\\N{%s}" % name
 return replace, exc.start + 1


I'm planning to built this error handler in 3.4 (see 
http://comments.gmane.org/gmane.comp.python.ideas/21296).


Actually Python implementation should looks like:

def namereplace_errors(exc):
if not isinstance(exc, UnicodeEncodeError):
raise exc
replace = []
for c in exc.object[exc.start:exc.end]:
try:
replace.append(r'\N{%s}' % unicodedata.name(c))
except KeyError:
n = ord(c)
if n < 0x100:
replace.append(r'\x%02x' % n)
elif n < 0x1:
replace.append(r'\u%04x' % n)
else:
replace.append(r'\U%08x' % n)
return ''.join(replace), exc.end


Now, my question:

Should the module holding the error handlers automatically register them?


This question interesting me too.


--
https://mail.python.org/mailman/listinfo/python-list

Re: API for custom Unicode error handlers

2013-10-04 Thread Serhiy Storchaka


04.10.13 20:22, Chris Angelico написав(ла):

I'd be quite happy with importing having a side-effect here. If you
import a module that implements a numeric type, it should immediately
register itself with the Numeric ABC, right? This is IMO equivalent to
that.


There is a difference. You can't use a numeric type without importing a 
module, but you can use error handler registered outside of your module.


This leads to subtle bugs. Let the A module imports error_handlers and 
uses error handle. The module B uses error handle but doesn't import 
error_handlers. C.py imports A and B and all works. D.py imports B and A 
and fails.



--
https://mail.python.org/mailman/listinfo/python-list

Re: API for custom Unicode error handlers

2013-10-04 Thread Ethan Furman


On 10/04/2013 06:56 AM, Steven D'Aprano wrote:


Should the module holding the error handlers automatically register them?


I think it should.

Registration only needs to happen once, the module is useless without being registered, no threads nor processes are 
being started, and the only reason to import the module is to get the functionality... isn't it?


What about help(), sphynx (sp?), or other introspection tools?

This sounds similar to cgitb -- another module which you only import if you want the html'ized traceback, and yet it 
requires a separate cgitb.enable() call...


I change my mind, it shouldn't.

Throw in a .enable() function and call it good.  :)

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list

Re: API for custom Unicode error handlers

2013-10-04 Thread Chris Angelico

On Fri, Oct 4, 2013 at 11:56 PM, Steven D'Aprano
 wrote:
> Should the module holding the error handlers automatically register them?
> In other words, if I do:
>
> import error_handlers
>
> just importing it will have the side-effect of registering the error
> handlers. Normally, I dislike imports that have side-effects of this
> sort, but I'm not sure that the alternative is better, that is, to put
> responsibility on the caller to register some, or all, of the handlers:
>
> import error_handlers
> error_handlers.register(error_handlers.namereplace_errors)
> error_handlers.register_all()

Caveat: I don't actually use codecs much, so I don't know the specifics.

I'd be quite happy with importing having a side-effect here. If you
import a module that implements a numeric type, it should immediately
register itself with the Numeric ABC, right? This is IMO equivalent to
that.

> As far as I know, there is no way to find out what error handlers are
> registered, and no way to deregister one after it has been registered.

The only risk that I see is of an accidental collision. Having a codec
registered that you don't use can't hurt (afaik). Is there any
mechanism for detecting a name collision? If not, I wouldn't worry
about it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

API for custom Unicode error handlers

2013-10-04 Thread Steven D'Aprano

I have some custom Unicode error handlers, and I'm looking for advice on 
the right API for dealing with them.

I have a module containing custom Unicode error handlers. For example:

# Python 3
import unicodedata
def namereplace_errors(exc):
c = exc.object[exc.start]
try:
name = unicodedata.name(c)
except (KeyError, ValueError):
n = ord(c)
if n <= 0x:
replace = "\\u%04x"
else:
assert n <= 0x10
replace = "\\U%08x"
replace = replace % n
else:
replace = "\\N{%s}" % name
return replace, exc.start + 1


Before I can use the error handler, I need to register it using this:


import codecs
codecs.register_error('namereplace', namereplace_errors)

And now:

py> 'abc\u04F1'.encode('ascii', 'namereplace')
b'abc\\N{CYRILLIC SMALL LETTER U WITH DIAERESIS}'


Now, my question:

Should the module holding the error handlers automatically register them? 
In other words, if I do:

import error_handlers

just importing it will have the side-effect of registering the error 
handlers. Normally, I dislike imports that have side-effects of this 
sort, but I'm not sure that the alternative is better, that is, to put 
responsibility on the caller to register some, or all, of the handlers:

import error_handlers
error_handlers.register(error_handlers.namereplace_errors)
error_handlers.register_all()


As far as I know, there is no way to find out what error handlers are 
registered, and no way to deregister one after it has been registered.

Which API would you prefer if you were using this module?


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-09 Thread wxjmfauth

Le jeudi 8 novembre 2012 21:42:58 UTC+1, Ian a écrit :
> On Thu, Nov 8, 2012 at 12:54 PM,   wrote:
> 
> > Font has nothing to do here.
> 
> > You are "simply" wrongly encoding your "unicode".
> 
> >
> 
>  '\u2013'
> 
> > '–'
> 
>  '\u2013'.encode('utf-8')
> 
> > b'\xe2\x80\x93'
> 
>  '\u2013'.encode('utf-8').decode('cp1252')
> 
> > 'â€“'
> 
> 
> 
> No, it seriously is the font.  This is what I get using the default
> 
> ("Raster") font:
> 
> 
> 
> C:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> C:\>c:\python33\python
> 
> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 
> 32 bit (Intel)] on win32
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> '\u2013'
> 
> 'â€“'
> 
> >>> import sys
> 
> >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
> 
> â€“
> 
> 4
> 
> 
> 
> I should note here that the characters copied and pasted do not
> 
> correspond to the glyphs actually displayed in my terminal window.  In
> 
> the terminal window I actually see:
> 
> 
> 
> ΓÇô
> 
> 
> 
> If I change the font to Lucida Console and run the *exact same code*,
> 
> I get this:
> 
> 
> 
> C:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> C:\>c:\python33\python
> 
> Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
> 
> 32 bit (Intel)] on win32
> 
> Type "help", "copyright", "credits" or "license" for more information.
> 
> >>> '\u2013'
> 
> '–'
> 
> 
> 
> >>> import sys
> 
> >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
> 
> –
> 
> 4
> 
> 
> 
> Why is the font important?  I have no idea.  Blame Microsoft.

-

If you have something like this 'ΓÇô'; in
Unicode nomenclature:
>>> import unicodedata as ud
>>> for c in 'ΓÇô':
... ud.name(c)
... 
'GREEK CAPITAL LETTER GAMMA'
'LATIN CAPITAL LETTER C WITH CEDILLA'
'LATIN SMALL LETTER O WITH CIRCUMFLEX'

it is a sign of a "cp437" somewhere.

>>> '\u2013'.encode('utf-8').decode('cp437')
'ΓÇô'

On Windows 7. I do not remember having once a "coding
of the caracters" issue on XP.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Andrew Berg

On 2012.11.08 08:06, Oscar Benjamin wrote:
> It would be a lot better though if it just worked straight away
> without me needing to set the code page (like the terminal in every
> other OS I use).
The crude equivalent of .bashrc/.zshrc/whatever shell startup script for
cmd is setting a string value (REG_SZ) in
HKCU\Software\Microsoft\Command Processor named autorun and setting that
with whatever command(s) you want to run whenever the shell starts. Mine
has a value of '@chcp 65001>nul'. I actually run zsh when practical
(gotta love Cygwin) and I have an equivalent command in my .zshrc.
Getting unicode to work in a Windows is a hassle, but it /can/ work.
CPython does have a bug that makes it annoying at times, though -
http://bugs.python.org/issue1602
-- 
CPython 3.3.0 | Windows NT 6.1.7601.17835
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Oscar Benjamin

On 8 November 2012 19:54,   wrote:
> Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
>> On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
>>
>>  wrote:
>>
>> > If I want the other characters to work I need to change the code page:
>>
>> >
>>
>> > O:\>chcp 65001
>>
>> > Active code page: 65001
>>
>> >
>>
>> > O:\>Q:\tools\Python33\python -c "import sys;
>>
>> I find that I also need to change the font.  With the default font,
>>
>> printing '\u2013' gives me:
>>
>> â€“
>>
>>
>>
>> The only alternative font option I have in Windows XP is Lucida
>>
>> Console, which at least works correctly, although it seems to be
>>
>> lacking a lot of glyphs.
>
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
>
 '\u2013'
> '–'
 '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
 '\u2013'.encode('utf-8').decode('cp1252')
> 'â€“'

You have correctly identified that the displayed characters are the
result of accidentally interpreting utf-8 bytes as if they were cp1252
or similar. However, it is not Ian or Python that is confusing the
encoding. It is cmd.exe that is confusing the encoding in a
font-dependent way. I also had to change the font as Ian describes
though I did it some time ago and forgot to mention it here.

jmf, can you please trim the text you quote removing the parts you are
not responding to and then any remaining blank lines that were
inserted by your reader/editor?

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Ian Kelly

On Thu, Nov 8, 2012 at 1:54 PM, Prasad, Ramit  wrote:
> Why would font not matter? Unicode is the abstract definition
> of all characters right? From that we map the abstract
> character to a code page/set, which gives real values for an
> abstract character. From that code page we then visually display
> the "real value" based on the font. If that font does
> not have a glyph for a specific character page (or a different
> glyph) then that is a problem and not related encoding.

Usually though when the font is missing a glyph for a Unicode
character, you just get a missing glyph symbol, such as an empty
rectangle.  For some reason when using the default font, cmd seemingly
ignores the active code page, skips decoding the characters, and tries
to print the individual bytes as if using code page 437.
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Right solution to unicode error?

2012-11-08 Thread Prasad, Ramit

wxjmfa...@gmail.com wrote:
> 
> Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
> > On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
> >
> >  wrote:
> >
> > > If I want the other characters to work I need to change the code page:
> > >
> > > O:\>chcp 65001
> > > Active code page: 65001
> > >
> > > O:\>Q:\tools\Python33\python -c "import sys;
> > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> > > α
> > >
> > > O:\>Q:\tools\Python33\python -c "import sys;
> > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> > > coding))"
> > > α
> >
> > I find that I also need to change the font.  With the default font,
> >
> > printing '\u2013' gives me:
> > â€“
> >
> > The only alternative font option I have in Windows XP is Lucida
> > Console, which at least works correctly, although it seems to be
> > lacking a lot of glyphs.
> 
> 
> 
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
> 


Why would font not matter? Unicode is the abstract definition 
of all characters right? From that we map the abstract 
character to a code page/set, which gives real values for an
abstract character. From that code page we then visually display 
the "real value" based on the font. If that font does
not have a glyph for a specific character page (or a different
glyph) then that is a problem and not related encoding. 

Unicode->code page->font


> >>> '\u2013'
> '–'
> >>> '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
> >>> '\u2013'.encode('utf-8').decode('cp1252')
> 'â€“'
> 

This is a mismatched translation between code pages; not
font related but is instead one abstraction "level" up. 


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Ian Kelly

On Thu, Nov 8, 2012 at 12:54 PM,   wrote:
> Font has nothing to do here.
> You are "simply" wrongly encoding your "unicode".
>
 '\u2013'
> '–'
 '\u2013'.encode('utf-8')
> b'\xe2\x80\x93'
 '\u2013'.encode('utf-8').decode('cp1252')
> 'â€“'

No, it seriously is the font.  This is what I get using the default
("Raster") font:

C:\>chcp 65001
Active code page: 65001

C:\>c:\python33\python
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2013'
'â€“'
>>> import sys
>>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
â€“
4

I should note here that the characters copied and pasted do not
correspond to the glyphs actually displayed in my terminal window.  In
the terminal window I actually see:

ΓÇô

If I change the font to Lucida Console and run the *exact same code*,
I get this:

C:\>chcp 65001
Active code page: 65001

C:\>c:\python33\python
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> '\u2013'
'–'

>>> import sys
>>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8'))
–
4

Why is the font important?  I have no idea.  Blame Microsoft.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread wxjmfauth

Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit :
> On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
> 
>  wrote:
> 
> > If I want the other characters to work I need to change the code page:
> 
> >
> 
> > O:\>chcp 65001
> 
> > Active code page: 65001
> 
> >
> 
> > O:\>Q:\tools\Python33\python -c "import sys;
> 
> > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> > α
> 
> >
> 
> > O:\>Q:\tools\Python33\python -c "import sys;
> 
> > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> > coding))"
> 
> > α
> 
> 
> 
> I find that I also need to change the font.  With the default font,
> 
> printing '\u2013' gives me:
> 
> 
> 
> â€“
> 
> 
> 
> The only alternative font option I have in Windows XP is Lucida
> 
> Console, which at least works correctly, although it seems to be
> 
> lacking a lot of glyphs.



Font has nothing to do here.
You are "simply" wrongly encoding your "unicode".

>>> '\u2013'
'–'
>>> '\u2013'.encode('utf-8')
b'\xe2\x80\x93'
>>> '\u2013'.encode('utf-8').decode('cp1252')
'â€“'

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread wxjmfauth

Le jeudi 8 novembre 2012 19:32:14 UTC+1, Oscar Benjamin a écrit :
> On 8 November 2012 15:05,   wrote:
> 
> > Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
> 
> >> On 8 November 2012 00:44, Oscar Benjamin  
> >> wrote:
> 
> >> > On 7 November 2012 23:51, Andrew Berg  wrote:
> 
> >> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
> 
> >>
> 
> >> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> 
> >> >>> support unicode
> 
> >>
> 
> >> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> 
> >> >> the OP since Python versions below 3.3 don't support cp65001, but I
> 
> >> >> think it's important to point out that the Windows command line system
> 
> >> >> (it is not unique to cmd) does in fact support Unicode.
> 
> >>
> 
> >> > I have tried to use code page 65001 and it didn't work for me even if
> 
> >> > I did use a version of Python (possibly 3.3 alpha) that claimed to
> 
> >> > support it.
> 
> >>
> 
> >> I stand corrected. I've just checked and codepage 65001 does work in
> 
> >> cmd.exe (on this machine):
> 
> >>
> 
> >> O:\>chcp 65001
> 
> >> Active code page: 65001
> 
> >>
> 
> >> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> >> abc-def
> 
> >>
> 
> >> O:\>Q:\tools\Python33\python -c print('\u03b1')
> 
> >> α
> 
> >>
> 
> >> It would be a lot better though if it just worked straight away
> 
> >> without me needing to set the code page (like the terminal in every
> 
> >> other OS I use).
> 
> >
> 
> > It *WORKS* straight away. The problem is that
> 
> > people do not wish to use unicode correctly
> 
> > (eg. Mulder's example).
> 
> > Read the point 1) and 4) in my previous post.
> 
> >
> 
> > Unicode and in general the coding of the characters
> 
> > have nothing to do with the os's or programming languages.
> 
> 
> 
> I don't know what you mean that it works "straight away".
> 
> 
> 
> The default code page on my machine is cp850.
> 
> 
> 
> O:\>chcp
> 
> Active code page: 850
> 
> 
> 
> cp850 doesn't understand utf-8. It just prints garbage:
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> ╬▒
> 
> 
> 
> Using the correct encoding doesn't help:
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))"
> 
> Traceback (most recent call last):
> 
>   File "", line 1, in 
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
> 
> return codecs.charmap_encode(input,errors,encoding_map)
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
> 
> position 0: character maps to
> 
>  
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> coding))"
> 
> Traceback (most recent call last):
> 
>   File "", line 1, in 
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
> 
> return codecs.charmap_encode(input,errors,encoding_map)
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
> 
> position 0: character maps to
> 
>  
> 
> 
> 
> If I want the other characters to work I need to change the code page:
> 
> 
> 
> O:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> 
> α
> 
> 
> 
> O:\>Q:\tools\Python33\python -c "import sys;
> 
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> 
> coding))"
> 
> α
> 
> 
> 
> 
> 
> Oscar

You are confusing two things. The coding of the
characters and the set of the characters (glyphes/graphemes)
of a coding scheme.

It is always possible to encode safely an unicode, but
the target coding may not contain the character.

Take a look at the output of this "special" interactive
interpreter" where the host coding (sys.stdout.encoding)
can be change on the fly.


>>> s = 'éléphant\u2013abcéœ€'
>>> sys.stdout.encoding
''
>>> s
'éléphant–abcéœ€'
>>> 
>>> sys.stdout.encoding = 'cp1252'
>>> s.encode('cp1252')
'éléphant–abcéœ€'
>>> sys.stdout.encoding = 'cp850'
>>> s.encode('cp850')
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Python32\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013'
in position 8: character maps to 
>>> # but
>>> s.encode('cp850', 'replace')
'éléphant?abcé??'
>>> 
>>> sys.stdout.encoding = 'utf-8'
>>> s
'Ã©lÃ©phantâ€“abcÃ©Å“â‚¬'
>>> s.encode('utf-8')
'éléphant–abcéœ€'
>>> 
>>> sys.stdout.encoding = 'utf-16-le'  <
>>> s
' é l é p h a n t  a b c é S ¬ '
>>> s.encode('utf-16-le')
'éléphant–abcéœ€'

<<< some cheating here do to the mail system, it really looks like this.

jmf


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Ian Kelly

On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin
 wrote:
> If I want the other characters to work I need to change the code page:
>
> O:\>chcp 65001
> Active code page: 65001
>
> O:\>Q:\tools\Python33\python -c "import sys;
> sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
> α
>
> O:\>Q:\tools\Python33\python -c "import sys;
> sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
> coding))"
> α

I find that I also need to change the font.  With the default font,
printing '\u2013' gives me:

â€“

The only alternative font option I have in Windows XP is Lucida
Console, which at least works correctly, although it seems to be
lacking a lot of glyphs.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Oscar Benjamin

On 8 November 2012 15:05,   wrote:
> Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
>> On 8 November 2012 00:44, Oscar Benjamin  wrote:
>> > On 7 November 2012 23:51, Andrew Berg  wrote:
>> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
>>
>> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>> >>> support unicode
>>
>> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
>> >> the OP since Python versions below 3.3 don't support cp65001, but I
>> >> think it's important to point out that the Windows command line system
>> >> (it is not unique to cmd) does in fact support Unicode.
>>
>> > I have tried to use code page 65001 and it didn't work for me even if
>> > I did use a version of Python (possibly 3.3 alpha) that claimed to
>> > support it.
>>
>> I stand corrected. I've just checked and codepage 65001 does work in
>> cmd.exe (on this machine):
>>
>> O:\>chcp 65001
>> Active code page: 65001
>>
>> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
>> abc-def
>>
>> O:\>Q:\tools\Python33\python -c print('\u03b1')
>> α
>>
>> It would be a lot better though if it just worked straight away
>> without me needing to set the code page (like the terminal in every
>> other OS I use).
>
> It *WORKS* straight away. The problem is that
> people do not wish to use unicode correctly
> (eg. Mulder's example).
> Read the point 1) and 4) in my previous post.
>
> Unicode and in general the coding of the characters
> have nothing to do with the os's or programming languages.

I don't know what you mean that it works "straight away".

The default code page on my machine is cp850.

O:\>chcp
Active code page: 850

cp850 doesn't understand utf-8. It just prints garbage:

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
╬▒

Using the correct encoding doesn't help:

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))"
Traceback (most recent call last):
  File "", line 1, in 
  File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
position 0: character maps to
 

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
coding))"
Traceback (most recent call last):
  File "", line 1, in 
  File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in
position 0: character maps to
 

If I want the other characters to work I need to change the code page:

O:\>chcp 65001
Active code page: 65001

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))"
α

O:\>Q:\tools\Python33\python -c "import sys;
sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en
coding))"
α


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread wxjmfauth

Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit :
> On 8 November 2012 00:44, Oscar Benjamin  wrote:
> 
> > On 7 November 2012 23:51, Andrew Berg  wrote:
> 
> >> On 2012.11.07 17:27, Oscar Benjamin wrote:
> 
> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> 
> >>> support unicode
> 
> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> 
> >> the OP since Python versions below 3.3 don't support cp65001, but I
> 
> >> think it's important to point out that the Windows command line system
> 
> >> (it is not unique to cmd) does in fact support Unicode.
> 
> >
> 
> > I have tried to use code page 65001 and it didn't work for me even if
> 
> > I did use a version of Python (possibly 3.3 alpha) that claimed to
> 
> > support it.
> 
> 
> 
> I stand corrected. I've just checked and codepage 65001 does work in
> 
> cmd.exe (on this machine):
> 
> 
> 
> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> Traceback (most recent call last):
> 
>   File "", line 1, in 
> 
>   File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode
> 
> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
> 
> UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in
> 
> position 3: character maps to
> 
>  
> 
> 
> 
> O:\>chcp 65001
> 
> Active code page: 65001
> 
> 
> 
> O:\>Q:\tools\Python33\python -c print('abc\u2013def')
> 
> abc-def
> 
> 
> 
> 
> 
> O:\>Q:\tools\Python33\python -c print('\u03b1')
> 
> α
> 
> 
> 
> It would be a lot better though if it just worked straight away
> 
> without me needing to set the code page (like the terminal in every
> 
> other OS I use).
> 
> 
> 
> 
> 
> Oscar

--

It *WORKS* straight away. The problem is that
people do not wish to use unicode correctly
(eg. Mulder's example).
Read the point 1) and 4) in my previous post.

Unicode and in general the coding of the characters
have nothing to do with the os's or programming languages.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Oscar Benjamin

On 8 November 2012 00:44, Oscar Benjamin  wrote:
> On 7 November 2012 23:51, Andrew Berg  wrote:
>> On 2012.11.07 17:27, Oscar Benjamin wrote:
>>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>>> support unicode
>> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
>> the OP since Python versions below 3.3 don't support cp65001, but I
>> think it's important to point out that the Windows command line system
>> (it is not unique to cmd) does in fact support Unicode.
>
> I have tried to use code page 65001 and it didn't work for me even if
> I did use a version of Python (possibly 3.3 alpha) that claimed to
> support it.

I stand corrected. I've just checked and codepage 65001 does work in
cmd.exe (on this machine):

O:\>Q:\tools\Python33\python -c print('abc\u2013def')
Traceback (most recent call last):
  File "", line 1, in 
  File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in
position 3: character maps to
 

O:\>chcp 65001
Active code page: 65001

O:\>Q:\tools\Python33\python -c print('abc\u2013def')
abc-def


O:\>Q:\tools\Python33\python -c print('\u03b1')
α

It would be a lot better though if it just worked straight away
without me needing to set the code page (like the terminal in every
other OS I use).


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Right solution to unicode error?

2012-11-08 Thread Anders Schneiderman

Thanks, Oscar and Ramit! This is exactly what I was looking for.

Anders 


> -Original Message-
> From: Oscar Benjamin [mailto:oscar.j.benja...@gmail.com]
> Sent: Wednesday, November 07, 2012 6:27 PM
> To: Anders Schneiderman
> Cc: python-list@python.org
> Subject: Re: Right solution to unicode error?
> 
> On 7 November 2012 22:17, Anders  wrote:
> >
> > Traceback (most recent call last):
> >   File "outlook_tasks.py", line 66, in 
> > my_tasks.dump_today_tasks()
> >   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> > dump_today_tasks
> > print task.subject
> > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> > position 42: ordinal not in range(128)
> >
> > Here's where I'm getting stuck.  In the code above I was just printing
> > the subject so I can see whether the script is working properly.
> > Ultimately what I want to do is parse the tasks I'm interested in and
> > then create an HTML file containing those tasks.  Given that, what's
> > the best way to fix this problem?
> 
> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> support unicode and Python is telling you that it cannot encode the string in 
> a
> way that can be understood by your terminal. You can try using chcp to set
> the code page to something that works with your script.
> 
> If you are only printing it for debugging purposes you can just print the 
> repr()
> of the string which will be ascii and will come out fine in your terminal. If 
> you
> want to write it to a html file you should encode the string with whatever
> encoding (probably utf-8) you use in the html file. If you really just want 
> your
> script to be able to print unicode characters then you need to use something
> other than cmd.exe (such as IDLE).
> 
> 
> Oscar

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread Hans Mulder

On 8/11/12 00:53:49, Steven D'Aprano wrote:
> This error confuses me. Is that an exact copy and paste of the error, or 
> have you edited it or reconstructed it? Because it seems to me that if 
> task.subject is a unicode string, as it appears to be, calling print on 
> it should succeed:
> 
> py> s = u'ABC\u2013DEF'
> py> print s
> ABC–DEF

That would depend on whether python thinks sys.stdout can
handle UTF8.  For example, on my MacOS X box:

$ python2.6 -c 'print u"abc\u2013def"'
abc–def
$ python2.6 -c 'print u"abc\u2013def"' | cat
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 3: ordinal not in range(128)

This is because python knows that my terminal is capable
of handling UTF8, but it has no idea whether the program at
the other end of a pipe had that ability, so it'll fall
back to ASCII only if sys.stdout goes to a pipe.

Apparently the OP has a terminal that doesn't handle UTF8,
or one that Python doesn't know about.

Hope this helps,

-- HansM
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-08 Thread wxjmfauth

Le mercredi 7 novembre 2012 23:17:42 UTC+1, Anders a écrit :
> I've run into a Unicode error, and despite doing some googling, I
> 
> can't figure out the right way to fix it. I have a Python 2.6 script
> 
> that reads my Outlook 2010 task list. I'm able to read the tasks from
> 
> Outlook and store them as a list of objects without a hitch.  But when
> 
> I try to print the tasks' subjects, one of the tasks is generating an
> 
> error:
> 
> 
> 
> Traceback (most recent call last):
> 
>   File "outlook_tasks.py", line 66, in 
> 
> my_tasks.dump_today_tasks()
> 
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> 
> dump_today_tasks
> 
> print task.subject
> 
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> 
> position 42: ordinal not in range(128)
> 
> 
> 
> (where task.subject  was previously assigned the value of
> 
> task.Subject, aka the Subject property of an Outlook 2010 TaskItem)
> 
> 
> 
> From what I understand from reading online, the error is telling me
> 
> that the subject line  contains an en dash and that Python is trying
> 
> to convert to ascii and failing (as it should).
> 
> 
> 
> Here's where I'm getting stuck.  In the code above I was just printing
> 
> the subject so I can see whether the script is working properly.
> 
> Ultimately what I want to do is parse the tasks I'm interested in and
> 
> then create an HTML file containing those tasks.  Given that, what's
> 
> the best way to fix this problem?
> 
> 
> 
> BTW, if there's a clear description of the best solution for this
> 
> particular problem – i.e., where I want to ultimately display the
> 
> results as HTML – please feel free to refer me to the link. I tried
> 
> reading a number of docs on the web but still feel pretty lost.
> 
> 
> 
> Thanks,
> 
> Anders

--


The problem is not on the Python side or specific
to Python. It is on the side of the "coding of
characters".

1) Unicode is an abstract entity, it has to be encoded
for the system/device that will host it.
Using Python:
.encode(host_coding)

2) The host_coding scheme may not contain the
character (glyph/grapheme) corresponding to the
"unicode character". In that case, 2 possible
solutions, "ignore" it ou "replace" it with a
substitution character.
Using Python:
.encode(host_coding, "ignore")
.encode(host_coding, "replace")

3) Detecting the host_coding, the most difficult
task. Either you have to hard-code it or you
may expect Python find it via its sys.encoding.

4) Due to the nature of unicode, it the unique
way to do it correctly.

Expectedly failing and not failing examples.
Mainly Py3, but it doesn't matter. Note: Py3 encodes
and creates a byte string, which has to be
decoded to produce a native (unicode) string, here
with cp1252.


Py2

>>> u'éléphant\u2013abc'.encode('ascii')

Traceback (most recent call last):
  File "", line 1, in 
u'éléphant\u2013abc'.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: 
ordinal not in range(128)
>>> print(u'éléphant\u2013abc'.encode('cp1252'))
éléphant–abc
>>> 

Py3

>>> 'éléphant\u2013abc'.encode('ascii')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in
position 0: ordinal not in range(128)
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore')
b'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace')
b'?l?phant?abc'
>>> 'éléphant\u2013abc'.encode('ascii', 'ignore').decode('cp1252')
'lphantabc'
>>> 'éléphant\u2013abc'.encode('ascii', 'replace').decode('cp1252')
'?l?phant?abc'
>>> 
>>> 'éléphant\u2013abc'.encode('cp1252').decode('cp1252')
'éléphant–abc'

>>> sys.stdout.encoding
'cp1252'
>>> 'éléphant\u2013abc'.encode(sys.stdout.encoding).decode('cp1252')
'éléphant–abc'

etc

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-07 Thread Oscar Benjamin

On 7 November 2012 23:51, Andrew Berg  wrote:
> On 2012.11.07 17:27, Oscar Benjamin wrote:
>> Are you using cmd.exe (standard Windows terminal)? If so, it does not
>> support unicode
> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
> the OP since Python versions below 3.3 don't support cp65001, but I
> think it's important to point out that the Windows command line system
> (it is not unique to cmd) does in fact support Unicode.

I have tried to use code page 65001 and it didn't work for me even if
I did use a version of Python (possibly 3.3 alpha) that claimed to
support it. It turned out that there were other Windows related
problems with using the codepage so that I had to do something like

chcp 65001 && python myscript.py && chcp 2521

(It was important for all those commands to be on the same line) I'm
not on Windows right now and I can't remember all the details but I
seem to remember that even with that awkwardness and changing the font
it still didn't actually work.

If you know how to make it work, I'd be interested to know.

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-07 Thread Steven D'Aprano

On Wed, 07 Nov 2012 14:17:42 -0800, Anders wrote:

> I've run into a Unicode error, and despite doing some googling, I can't
> figure out the right way to fix it. I have a Python 2.6 script that
> reads my Outlook 2010 task list. I'm able to read the tasks from Outlook
> and store them as a list of objects without a hitch.  But when I try to
> print the tasks' subjects, one of the tasks is generating an error:
> 
> Traceback (most recent call last):
>   File "outlook_tasks.py", line 66, in 
> my_tasks.dump_today_tasks()
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
> print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)


This error confuses me. Is that an exact copy and paste of the error, or 
have you edited it or reconstructed it? Because it seems to me that if 
task.subject is a unicode string, as it appears to be, calling print on 
it should succeed:

py> s = u'ABC\u2013DEF'
py> print s
ABC–DEF

What does type(task.subject) return?


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-07 Thread Andrew Berg

On 2012.11.07 17:27, Oscar Benjamin wrote:
> Are you using cmd.exe (standard Windows terminal)? If so, it does not
> support unicode
Actually, it does. Code page 65001 is UTF-8. I know that doesn't help
the OP since Python versions below 3.3 don't support cp65001, but I
think it's important to point out that the Windows command line system
(it is not unique to cmd) does in fact support Unicode.
-- 
CPython 3.3.0 | Windows NT 6.1.7601.17835
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Right solution to unicode error?

2012-11-07 Thread Oscar Benjamin

On 7 November 2012 22:17, Anders  wrote:
>
> Traceback (most recent call last):
>   File "outlook_tasks.py", line 66, in 
> my_tasks.dump_today_tasks()
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
> print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)
>
> Here's where I'm getting stuck.  In the code above I was just printing
> the subject so I can see whether the script is working properly.
> Ultimately what I want to do is parse the tasks I'm interested in and
> then create an HTML file containing those tasks.  Given that, what's
> the best way to fix this problem?

Are you using cmd.exe (standard Windows terminal)? If so, it does not
support unicode and Python is telling you that it cannot encode the
string in a way that can be understood by your terminal. You can try
using chcp to set the code page to something that works with your
script.

If you are only printing it for debugging purposes you can just print
the repr() of the string which will be ascii and will come out fine in
your terminal. If you want to write it to a html file you should
encode the string with whatever encoding (probably utf-8) you use in
the html file. If you really just want your script to be able to print
unicode characters then you need to use something other than cmd.exe
(such as IDLE).

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Right solution to unicode error?

2012-11-07 Thread Prasad, Ramit

Anders wrote:
> 
> I've run into a Unicode error, and despite doing some googling, I
> can't figure out the right way to fix it. I have a Python 2.6 script
> that reads my Outlook 2010 task list. I'm able to read the tasks from
> Outlook and store them as a list of objects without a hitch.  But when
> I try to print the tasks' subjects, one of the tasks is generating an
> error:
> 
> Traceback (most recent call last):
>   File "outlook_tasks.py", line 66, in 
> my_tasks.dump_today_tasks()
>   File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
> dump_today_tasks
> print task.subject
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 42: ordinal not in range(128)
> 
> (where task.subject  was previously assigned the value of
> task.Subject, aka the Subject property of an Outlook 2010 TaskItem)
> 
> From what I understand from reading online, the error is telling me
> that the subject line  contains an en dash and that Python is trying
> to convert to ascii and failing (as it should).
> 
> Here's where I'm getting stuck.  In the code above I was just printing
> the subject so I can see whether the script is working properly.
> Ultimately what I want to do is parse the tasks I'm interested in and
> then create an HTML file containing those tasks.  Given that, what's
> the best way to fix this problem?
> 
> BTW, if there's a clear description of the best solution for this
> particular problem - i.e., where I want to ultimately display the
> results as HTML - please feel free to refer me to the link. I tried
> reading a number of docs on the web but still feel pretty lost.
> 

You can always encode in a non-ASCII codec. 
`print task.subject.encode()` where  is something that
supports the characters you want e.g. latin1. 

The list of built in codecs can be found:
http://docs.python.org/library/codecs.html#standard-encodings


~Ramit



This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  
-- 
http://mail.python.org/mailman/listinfo/python-list

Right solution to unicode error?

2012-11-07 Thread Anders

I've run into a Unicode error, and despite doing some googling, I
can't figure out the right way to fix it. I have a Python 2.6 script
that reads my Outlook 2010 task list. I'm able to read the tasks from
Outlook and store them as a list of objects without a hitch.  But when
I try to print the tasks' subjects, one of the tasks is generating an
error:

Traceback (most recent call last):
  File "outlook_tasks.py", line 66, in 
my_tasks.dump_today_tasks()
  File "C:\Users\Anders\code\Task List\tasks.py", line 29, in
dump_today_tasks
print task.subject
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 42: ordinal not in range(128)

(where task.subject  was previously assigned the value of
task.Subject, aka the Subject property of an Outlook 2010 TaskItem)

>From what I understand from reading online, the error is telling me
that the subject line  contains an en dash and that Python is trying
to convert to ascii and failing (as it should).

Here's where I'm getting stuck.  In the code above I was just printing
the subject so I can see whether the script is working properly.
Ultimately what I want to do is parse the tasks I'm interested in and
then create an HTML file containing those tasks.  Given that, what's
the best way to fix this problem?

BTW, if there's a clear description of the best solution for this
particular problem – i.e., where I want to ultimately display the
results as HTML – please feel free to refer me to the link. I tried
reading a number of docs on the web but still feel pretty lost.

Thanks,
Anders
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Why are some unicode error handlers "encode only"?

2012-03-11 Thread Terry Reedy

On 3/11/2012 10:37 AM, Steven D'Aprano wrote:

At least two standard error handlers are documented as working for
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this?

I presume the purpose of both is to facilitate transmission of unicode 
text via byte transmission by extending incomplete byte encodings by 
replacing unicode chars that do not fit in the given encoding by a ascii 
byte sequence that will fit.

I don't see why they shouldn't work for decoding as well.
Consider this example using Python 3.2:

b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

Traceback (most recent call last):
   File "", line 1, in
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
or can't be supported?

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=>  r'aaa--騷--\xe9\x21--bbb'

This output does not round-trip and would be a bit of a fib since it 
somewhat misrepresents what the encoded bytes were:

>>> r'aaa--騷--\xe9\x21--bbb'.encode("cp932")
b'aaa--\xe9z--\\xe9\\x21--bbb'
>>> b'aaa--\xe9z--\\xe9\\x21--bbb'.decode("cp932")
'aaa--騷--\\xe9\\x21--bbb'

Python 3 added surrogateescape error handling to solve this problem.

and similarly for xmlcharrefreplace.

Since xml character references are representations of unicode chars, and 
not bytes, I do not see how that would work. By analogy, perhaps you 
mean to have '&#e9;' in your output instead of '\xe9\x21', but 
those would not properly be xml numeric character references.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Why are some unicode error handlers "encode only"?

2012-03-11 Thread Walter Dörwald


On 11.03.12 15:37, Steven D'Aprano wrote:


At least two standard error handlers are documented as working for
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well.


Because xmlcharrefreplace and backslashreplace are *error* handlers. 
However the bytes sequence b'〹' does *not* contain any bytes that 
are not decodable for e.g. the ASCII codec. So there are no errors to 
handle.



Consider this example using Python 3.2:


b"aaa--\xe9z--\xe9!--bbb".decode("cp932")

Traceback (most recent call last):
   File "", line 1, in
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10:
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't
or can't be supported?


The byte sequence b'\xe9!' however is not something that would have been 
produced by the backslashreplace error handler. b'\\xe9!' (a sequence 
containing 5 bytes) would have been (and this probably would decode 
without any problems with the cp932 codec).



# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=>  r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.


This would require a postprocess step *after* the bytes have been 
decoded. This is IMHO out of scope for Python's codec machinery.


Servus,
   Walter

--
http://mail.python.org/mailman/listinfo/python-list

Why are some unicode error handlers "encode only"?

2012-03-11 Thread Steven D'Aprano

At least two standard error handlers are documented as working for 
encoding only:

xmlcharrefreplace
backslashreplace

See http://docs.python.org/library/codecs.html#codec-base-classes

and http://docs.python.org/py3k/library/codecs.html

Why is this? I don't see why they shouldn't work for decoding as well. 
Consider this example using Python 3.2:

>>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: 
illegal multibyte sequence

The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also 
known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't 
or can't be supported?

# This doesn't actually work.
b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace")
=> r'aaa--騷--\xe9\x21--bbb'

and similarly for xmlcharrefreplace.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-09 Thread Rickard Lindberg

On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert  wrote:
> On Tue, Feb 8, 2011 at 7:57 AM, Rickard Lindberg  wrote:
>> Hi,
>>
>> Here is a bash script to reproduce my error:
>
> Including the error message and traceback is still helpful, for future
> reference.
>
>>    #!/bin/sh
>>
>>    cat > å.timeline < 
>>    EOF
>>
>>    python <>    # encoding: utf-8
>>    from xml.sax import parse
>>    from xml.sax.handler import ContentHandler
>>    parse(u"å.timeline", ContentHandler())
>>    EOF
>>
>> If I instead do
>>
>>    parse(u"å.timeline".encode("utf-8"), ContentHandler())
>>
>> the script runs without errors.
>>
>> Is this a bug or expected behavior?
>
> Bug; open() figures out the filesystem encoding just fine.
> Bug tracker to report the issue to: http://bugs.python.org/
>
> Workaround:
> parse(open(u"å.timeline", 'r'), ContentHandler())
>
> Cheers,
> Chris

Bug reported at http://bugs.python.org/issue11159

-- 
Rickard Lindberg
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-09 Thread Stefan Behnel


Rickard Lindberg, 09.02.2011 14:01:

Did you read my reply?


Sorry, it was me who failed to read your question properly.

Unicode file names aren't really working well, especially not in Py2.x.
Python 3.2 provides many improvements here.

I assume your file system encoding is UTF-8? What does
sys.getfilesystemencoding() give you?


My getfilesystemencoding() returns utf-8.


Ok, same here. I tried it with Python 3.1.2 and it works for me.

So I think the right work-around for you in Python 2 is to encode the file 
name using whatever "sys.getfilesystemencoding()" returns.


And I agree with Chris Rebert that you should open a bug against the sax 
package in Python 2.7 on the bug tracker.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-09 Thread Rickard Lindberg

>> Did you read my reply?
>
>Sorry, it was me who failed to read your question properly.
>
>Unicode file names aren't really working well, especially not in Py2.x.
>Python 3.2 provides many improvements here.
>
>I assume your file system encoding is UTF-8? What does
>sys.getfilesystemencoding() give you?
>
>Stefan

Since I'm not registered on the Python mailing list I had some trouble
replying to your message.

My getfilesystemencoding() returns utf-8.

--
Rickard Lindberg
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-09 Thread Stefan Behnel


Stefan Behnel, 09.02.2011 09:58:

Rickard Lindberg, 09.02.2011 09:32:

On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert wrote:

Here is a bash script to reproduce my error:


Including the error message and traceback is still helpful, for future
reference.


Thanks for pointing it out.


#!/bin/sh

cat> å.timeline<


EOF

python<

Bug; open() figures out the filesystem encoding just fine.
Bug tracker to report the issue to: http://bugs.python.org/

Workaround:
parse(open(u"å.timeline", 'r'), ContentHandler())


When I tried your workaround, I still got this error:

Traceback (most recent call last):
File "", line 4, in
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py",
line 31, in parse
parser.parse(filename_or_stream)
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py",
line 109, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py",
line 119, in parse
self.prepareParser(source)
File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py",
line 121, in prepareParser
self._parser.SetBase(source.getSystemId())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in
position 0: ordinal not in range(128)

The open(..) part works fine, but there still seems to be a problem
inside the
sax parser.


Did you read my reply?


Sorry, it was me who failed to read your question properly.

Unicode file names aren't really working well, especially not in Py2.x. 
Python 3.2 provides many improvements here.


I assume your file system encoding is UTF-8? What does 
sys.getfilesystemencoding() give you?


Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-09 Thread Stefan Behnel


Rickard Lindberg, 09.02.2011 09:32:

On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert  wrote:

Here is a bash script to reproduce my error:


Including the error message and traceback is still helpful, for future
reference.


Thanks for pointing it out.


#!/bin/sh

cat>  å.timeline<


EOF

python<

Bug; open() figures out the filesystem encoding just fine.
Bug tracker to report the issue to: http://bugs.python.org/

Workaround:
parse(open(u"å.timeline", 'r'), ContentHandler())


When I tried your workaround, I still got this error:

Traceback (most recent call last):
   File "", line 4, in
   File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py",
line 31, in parse
 parser.parse(filename_or_stream)
   File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py",
line 109, in parse
 xmlreader.IncrementalParser.parse(self, source)
   File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py",
line 119, in parse
 self.prepareParser(source)
   File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py",
line 121, in prepareParser
 self._parser.SetBase(source.getSystemId())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in
position 0: ordinal not in range(128)

The open(..) part works fine, but there still seems to be a problem inside the
sax parser.


Did you read my reply?

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-09 Thread Rickard Lindberg

On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert  wrote:
>> Here is a bash script to reproduce my error:
>
> Including the error message and traceback is still helpful, for future
> reference.

Thanks for pointing it out.

>>    #!/bin/sh
>>
>>    cat > å.timeline < 
>>    EOF
>>
>>    python <>    # encoding: utf-8
>>    from xml.sax import parse
>>    from xml.sax.handler import ContentHandler
>>    parse(u"å.timeline", ContentHandler())
>>    EOF
>>
>> If I instead do
>>
>>    parse(u"å.timeline".encode("utf-8"), ContentHandler())
>>
>> the script runs without errors.
>>
>> Is this a bug or expected behavior?
>
> Bug; open() figures out the filesystem encoding just fine.
> Bug tracker to report the issue to: http://bugs.python.org/
>
> Workaround:
> parse(open(u"å.timeline", 'r'), ContentHandler())

When I tried your workaround, I still got this error:

Traceback (most recent call last):
  File "", line 4, in 
  File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py",
line 31, in parse
parser.parse(filename_or_stream)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py",
line 109, in parse
xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py",
line 119, in parse
self.prepareParser(source)
  File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py",
line 121, in prepareParser
self._parser.SetBase(source.getSystemId())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in
position 0: ordinal not in range(128)

The open(..) part works fine, but there still seems to be a problem inside the
sax parser.

-- 
Rickard Lindberg
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-08 Thread Stefan Behnel


Rickard Lindberg, 08.02.2011 16:57:

Hi,

Here is a bash script to reproduce my error:

 #!/bin/sh

 cat>  å.timeline<
 
   0.13.0devb38ace0a572b+
   
   
   
 
   2011-02-01 00:00:00
   2011-02-03 08:46:00
   asdsd
 
   
   
 
   2011-01-24 16:38:11
   2011-02-23 16:38:11
 
 
 
   
 
 EOF

 python<

Expected behaviour. You cannot parse XML from unicode strings, especially 
not when the XML data explicitly declares itself as being encoded in UTF-8.


Parse from a byte string instead, as you do in your fixed code.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in sax parser

2011-02-08 Thread Chris Rebert

On Tue, Feb 8, 2011 at 7:57 AM, Rickard Lindberg  wrote:
> Hi,
>
> Here is a bash script to reproduce my error:

Including the error message and traceback is still helpful, for future
reference.

>    #!/bin/sh
>
>    cat > å.timeline <
>    EOF
>
>    python <    # encoding: utf-8
>    from xml.sax import parse
>    from xml.sax.handler import ContentHandler
>    parse(u"å.timeline", ContentHandler())
>    EOF
>
> If I instead do
>
>    parse(u"å.timeline".encode("utf-8"), ContentHandler())
>
> the script runs without errors.
>
> Is this a bug or expected behavior?

Bug; open() figures out the filesystem encoding just fine.
Bug tracker to report the issue to: http://bugs.python.org/

Workaround:
parse(open(u"å.timeline", 'r'), ContentHandler())

Cheers,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode error in sax parser

2011-02-08 Thread Rickard Lindberg

Hi,

Here is a bash script to reproduce my error:

#!/bin/sh

cat > å.timeline <

  0.13.0devb38ace0a572b+
  
  
  

  2011-02-01 00:00:00
  2011-02-03 08:46:00
  asdsd

  
  

  2011-01-24 16:38:11
  2011-02-23 16:38:11



  

EOF

python

Re: Unicode error

2010-08-07 Thread kj

In <4c5d4ad9$0$28666$c3e8...@news.astraweb.com> Steven D'Aprano 
 writes:

>On Sat, 07 Aug 2010 19:28:56 +1200, Gregory Ewing wrote:

>> Steven D'Aprano wrote:
>>> "No memory?  No disk space?  No problem! Just a flesh wound!"  What's
>>> the point of that?
>> 
>> +1 QOTW

>While I'm always happy to be nominated for QOTW, in this case I didn't 
>say it, and the nomination should go to KJ.


(The ol' "insert Monty Python reference" move: it never fails...) 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-08-07 Thread Gregory Ewing


Steven D'Aprano wrote:

"No memory?  No disk space?  No problem! Just a flesh
wound!"  What's the point of that?


+1 QOTW
--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-08-06 Thread Steven D'Aprano

On Fri, 06 Aug 2010 11:23:50 +, kj wrote:

> I don't get your point.  Even when I *know* that a certain exception may
> happen, I don't necessarily catch it.  I catch only those exceptions for
> which I can think of a suitable response that is *different* from just
> letting the program fail.  (After all, my own code raises its own
> exceptions with the precise intention of making the program fail.)  If
> an unexpected exception occurs, then by definition, I had no better
> response in mind for that situation than just letting the program fail,
> so I'm happy to let that happen. If, afterwards, I think of a different
> response for a previously uncaught exception, I'll modify the code
> accordingly.
> 
> I find this approach far preferable to the alternative of knowing a long
> list of possible exceptions (some of which may never happen in actual
> practice), and think of ways to keep the program still alive
> no-matter-what.  "No memory?  No disk space?  No problem! Just a flesh
> wound!"  What's the point of that?

/me cheers wildly!

Well said!



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-08-06 Thread kj

In  Nobody  
writes:

>On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote:

>> Don't write bare excepts, always catch the error you want and nothing 
>> else.

>That advice would make more sense if it was possible to know which
>exceptions could be raised. In practice, that isn't possible, as the
>documentation seldom provides this information. Even for the built-in
>classes, the documentation is weak in this regard; for less important
>modules and third-party libraries, it's entirely absent.

I don't get your point.  Even when I *know* that a certain exception
may happen, I don't necessarily catch it.  I catch only those
exceptions for which I can think of a suitable response that is
*different* from just letting the program fail.  (After all, my
own code raises its own exceptions with the precise intention of
making the program fail.)  If an unexpected exception occurs, then
by definition, I had no better response in mind for that situation
than just letting the program fail, so I'm happy to let that happen.
If, afterwards, I think of a different response for a previously
uncaught exception, I'll modify the code accordingly.

I find this approach far preferable to the alternative of knowing
a long list of possible exceptions (some of which may never happen
in actual practice), and think of ways to keep the program still
alive no-matter-what.  "No memory?  No disk space?  No problem!
Just a flesh wound!"  What's the point of that?

(If I want the final error message to be something other than a
bare stack trace, I may wrap the whole execution in a global/top-level
try/catch block so that I can fashion a suitable error message
right before calling exit, but that's just "softening the fall":
the program still will go down.)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-08-04 Thread Aahz

In article ,
Nobody   wrote:
>
>Java's checked exception mechanism was based on real-world experience of
>the pitfalls of abstract types. And that experience was gained in
>environments where interface specifications were far more detailed than is
>the norm in the Python world.

There are a number of people who claim that checked exceptions are the
wrong answer:

http://www.mindview.net/Etc/Discussions/CheckedExceptions
-- 
Aahz (a...@pythoncraft.com)   <*> http://www.pythoncraft.com/

"Normal is what cuts off your sixth finger and your tail..."  --Siobhan
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-25 Thread Nobody

On Sun, 25 Jul 2010 14:47:11 +, Steven D'Aprano wrote:

>>> But in the
>>> meanwhile, once you get an error, you know what it is. You can
>>> intentionally feed code bad data and see what you get. And then maybe
>>> add a test to make sure your code traps such errors.
>> 
>> That doesn't really help with exceptions which are triggered by external
>> factors rather than explicit inputs.
> 
> Huh? What do you mean by "external factors"?

I mean this:

> If you mean external factors like "the network goes down" or "the disk is 
> full",

> you can still test for those with appropriate test doubles (think 
> "stunt doubles", only for testing) such as stubs or mocks. It's a little 
> bit more work (sometimes a lot more work), but it can be done.

I'd say "a lot" is more often the case.

>> Also, if you're writing libraries (rather than self-contained programs),
>> you have no control over the arguments. 
> 
> You can't control what the caller passes to you, but once you have it, 
> you have total control over it.

Total control insofar as you can wrap all method calls in semi-bare
excepts (i.e. catch any Exception but not Interrupt).

>> Coupled with the fact that duck
>> typing is quite widely advocated in Python circles, you're stuck with
>> the possibility that any method call on any argument can raise any
>> exception. This is even true for calls to standard library functions or
>> methods of standard classes if you're passing caller-supplied objects as
>> arguments.
> 
> That's a gross exaggeration. It's true that some methods could in theory 
> raise any exception, but in practice most exceptions are vanishingly 
> rare.

Now *that* is a gross exaggeration. Exceptions are by their nature
exceptional, in some sense of the word. But a substantial part of Python
development is playing whac-a-mole with exceptions. Write code, run
code, get traceback, either fix the cause (LBYL) or handle the exception
(EAFP), wash, rinse, repeat.

> And it isn't even remotely correct that "any" method could raise 
> anything. If you can get something other than NameError, ValueError or 
> TypeError by calling "spam".index(arg), I'd like to see it.

How common is it to call methods on a string literal in real-world code?

It's far, far more common to call methods on an argument or expression
whose value could be any "string-like object" (e.g. UserString or a str
subclass).

IOW, it's "almost" correct that any method can raise any exception. The
fact that the number of counter-examples is non-zero doesn't really
change this. Even an isinstance() check won't help, as nothing prohibits a
subclass from raising exceptions which the original doesn't. Even using
"type(x) == sometype" doesn't help if x's methods involve calling methods
of user-supplied values (unless those methods are wrapped in catch-all
excepts).

Java's checked exception mechanism was based on real-world experience of
the pitfalls of abstract types. And that experience was gained in
environments where interface specifications were far more detailed than is
the norm in the Python world.

> Frankly, it sounds to me that you're over-analysing all the things that 
> "could" go wrong rather than focusing on the things that actually do go 
> wrong.

See Murphy's Law.

> That's your prerogative, of course, but I don't think you'll get 
> much support for it here.

Alas, I suspect that you're correct. Which is why I don't advocate using
Python for "serious" software. Neither the language nor its "culture" are
amenable to robustness.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-25 Thread Steven D'Aprano

On Sun, 25 Jul 2010 13:52:33 +0100, Nobody wrote:

> On Fri, 23 Jul 2010 18:27:50 -0400, Terry Reedy wrote:
> 
>> But in the
>> meanwhile, once you get an error, you know what it is. You can
>> intentionally feed code bad data and see what you get. And then maybe
>> add a test to make sure your code traps such errors.
> 
> That doesn't really help with exceptions which are triggered by external
> factors rather than explicit inputs.

Huh? What do you mean by "external factors"? Do you mean like power 
supply fluctuations, cosmic rays flipping bits in memory, bad hardware? 
You can't defend against that, not without specialist fault-tolerant 
hardware, so just don't worry about it.

If you mean external factors like "the network goes down" or "the disk is 
full", you can still test for those with appropriate test doubles (think 
"stunt doubles", only for testing) such as stubs or mocks. It's a little 
bit more work (sometimes a lot more work), but it can be done.

Or don't worry about it. Release early, release often, and take lots of 
logs. You'll soon learn what exceptions can happen and what can't. Your 
software is still useful even when it's not perfect, and there's always 
time for another bug fix release.

> Also, if you're writing libraries (rather than self-contained programs),
> you have no control over the arguments. 

You can't control what the caller passes to you, but once you have it, 
you have total control over it. You can reject it with an exception, 
stick it inside a wrapper object, convert it to something else, deal with 
it as best you can, or just ignore it.

> Coupled with the fact that duck
> typing is quite widely advocated in Python circles, you're stuck with
> the possibility that any method call on any argument can raise any
> exception. This is even true for calls to standard library functions or
> methods of standard classes if you're passing caller-supplied objects as
> arguments.

That's a gross exaggeration. It's true that some methods could in theory 
raise any exception, but in practice most exceptions are vanishingly 
rare. And it isn't even remotely correct that "any" method could raise 
anything. If you can get something other than NameError, ValueError or 
TypeError by calling "spam".index(arg), I'd like to see it.

Frankly, it sounds to me that you're over-analysing all the things that 
"could" go wrong rather than focusing on the things that actually do go 
wrong. That's your prerogative, of course, but I don't think you'll get 
much support for it here.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-25 Thread Nobody

On Fri, 23 Jul 2010 18:27:50 -0400, Terry Reedy wrote:

> But in the 
> meanwhile, once you get an error, you know what it is. You can 
> intentionally feed code bad data and see what you get. And then maybe 
> add a test to make sure your code traps such errors.

That doesn't really help with exceptions which are triggered by external
factors rather than explicit inputs.

Also, if you're writing libraries (rather than self-contained programs),
you have no control over the arguments. Coupled with the fact that
duck typing is quite widely advocated in Python circles, you're stuck with
the possibility that any method call on any argument can raise any
exception. This is even true for calls to standard library functions or
methods of standard classes if you're passing caller-supplied objects as
arguments.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-24 Thread John Machin

dirknbr  gmail.com> writes:

> I have kind of developped this but obviously it's not nice, any better
> ideas?
> 
> try:
> text=texts[i]
> text=text.encode('latin-1')
> text=text.encode('utf-8')
> except:
> text=' '

As Steven has pointed out, if the .encode('latin-1') works, the result is thrown
away. This would be very fortunate. 

It appears that your goal was to encode the text in latin1 if possible,
otherwise in UTF-8, with no indication of which encoding was used. Your second
posting confirmed that you were doing this in a loop, ending up with the
possibility that your output file would have records with mixed encodings.

Did you consider what a programmer writing code to READ your output file would
need to do, e.g. attempt to decode each record as UTF-8 with a fall-back to
latin1??? Did you consider what would be the result of sending a stream of
mixed-encoding text to a display device?

As already advised, the short answer to avoid all of that hassle; just encode in
UTF-8.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Steven D'Aprano

On Fri, 23 Jul 2010 22:46:46 +0100, Nobody wrote:

> On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote:
> 
>> Don't write bare excepts, always catch the error you want and nothing
>> else.
> 
> That advice would make more sense if it was possible to know which
> exceptions could be raised. In practice, that isn't possible, as the
> documentation seldom provides this information. Even for the built-in
> classes, the documentation is weak in this regard; for less important
> modules and third-party libraries, it's entirely absent.

Aside: that's an awfully sweeping generalisation for all third-party 
libraries.

Yes, the documentation is sometimes weak, but that doesn't stop you from 
being sensible. Catching any exception, no matter what, whether you've 
heard of it or seen it before or not, is almost never a good idea. The 
two problems with bare excepts are:

* They mask user generated keyboard interrupts, which is rude.

* They hide unexpected errors and disguise them as expected errors.

You want unexpected errors to raise an exception as early as possible, 
because they probably indicate a bug in your code, and the earlier you 
see the exception, the easier it is to debug.

And even if they don't indicate a bug in your code, but merely an under-
documented function, it's still better to find out what that is rather 
than sweep it under the carpet. You will have learned something new ("oh, 
the httplib functions can raise socket.error as well can they?") which 
makes you a better programmer, you have the opportunity to improve the 
documentation, you might want to handle it differently ("should I try 
again, or just give up now, or reset the flubbler?").

If you decide to just mask the exception, rather than handle it in some 
other way, it is easy enough to add an extra check to the except clause.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Terry Reedy


On 7/23/2010 5:46 PM, Nobody wrote:

On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote:


Don't write bare excepts, always catch the error you want and nothing
else.


That advice would make more sense if it was possible to know which
exceptions could be raised. In practice, that isn't possible, as the
documentation seldom provides this information. Even for the built-in
classes, the documentation is weak in this regard; for less important
modules and third-party libraries, it's entirely absent.


I intend to bring that issue up on pydev list sometime. But in the 
meanwhile, once you get an error, you know what it is. You can 
intentionally feed code bad data and see what you get. And then maybe 
add a test to make sure your code traps such errors.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Thomas Jollans

On 07/23/2010 11:46 PM, Nobody wrote:
> On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote:
> 
>> Don't write bare excepts, always catch the error you want and nothing 
>> else.
> 
> That advice would make more sense if it was possible to know which
> exceptions could be raised. In practice, that isn't possible, as the
> documentation seldom provides this information. Even for the built-in
> classes, the documentation is weak in this regard; for less important
> modules and third-party libraries, it's entirely absent.
> 

In practice, at least in Python, it tends to be better to work the
"other way around": first, write code without exception handlers. Test.
If you get an exception, there are really two possible reactions:

 1. "WHAT??"
  => This shouldn't be happening. Rather than catching everything,
 fix your code, or think it through until you reach conclusion
 #2 below.

2. "Ah, yes. Of course. I should check for that."
  => No problem! You're staring at a traceback right now, so you
 know the exception raised.

If you know there should be an exception, but you don't know which one,
it should be trivial to create condition in which the exception arises,
should it not? Then, you can handle it properly, without resorting to
guesswork or over-generalisations.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Benjamin Kaplan

On Fri, Jul 23, 2010 at 2:46 PM, Nobody  wrote:
> On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote:
>
>> Don't write bare excepts, always catch the error you want and nothing
>> else.
>
> That advice would make more sense if it was possible to know which
> exceptions could be raised. In practice, that isn't possible, as the
> documentation seldom provides this information. Even for the built-in
> classes, the documentation is weak in this regard; for less important
> modules and third-party libraries, it's entirely absent.
>

You still don't want to use bare excepts.People tend to get rather
annoyed when you handle KeyboardInterrupts and SystemExits like you
would a UnicodeError. Use Exception if you don't know what exceptions
can be raised.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Nobody

On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote:

> Don't write bare excepts, always catch the error you want and nothing 
> else.

That advice would make more sense if it was possible to know which
exceptions could be raised. In practice, that isn't possible, as the
documentation seldom provides this information. Even for the built-in
classes, the documentation is weak in this regard; for less important
modules and third-party libraries, it's entirely absent.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Thomas Jollans

On 07/23/2010 12:56 PM, dirknbr wrote:
> To give a bit of context. I am using twython which is a wrapper for
> the JSON API
> 
>  
> search=twitter.searchTwitter(s,rpp=100,page=str(it),result_type='recent',lang='en')
> for u in search[u'results']:
> ids.append(u[u'id'])
> texts.append(u[u'text'])
> 
> This is where texts comes from.
> 
> When I then want to write texts to a file I get the unicode error.

So your data is unicode? Good.

Well, files are just streams of bytes, so to write unicode data to one
you have to encode it. Since Python can't know which encoding you want
to use (utf-8, by the way, if you ask me), you have to do it manually.

something like:

outfile.write(text.encode('utf-8'))

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread dirknbr

To give a bit of context. I am using twython which is a wrapper for
the JSON API

 
search=twitter.searchTwitter(s,rpp=100,page=str(it),result_type='recent',lang='en')
for u in search[u'results']:
ids.append(u[u'id'])
texts.append(u[u'text'])

This is where texts comes from.

When I then want to write texts to a file I get the unicode error.

Dirk
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Chris Rebert

On Fri, Jul 23, 2010 at 3:14 AM, dirknbr  wrote:
> I am having some problems with unicode from json.
>
> This is the error I get
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\x93' in
> position 61: ordinal not in range(128)

Please include the full Traceback and the actual code that's causing
the error! We aren't mind readers.

This error basically indicates that you're incorrectly mixing byte
strings and Unicode strings somewhere.

Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error

2010-07-23 Thread Steven D'Aprano

On Fri, 23 Jul 2010 03:14:11 -0700, dirknbr wrote:

> I am having some problems with unicode from json.
> 
> This is the error I get
> 
> UnicodeEncodeError: 'ascii' codec can't encode character u'\x93' in
> position 61: ordinal not in range(128)
> 
> I have kind of developped this but obviously it's not nice, any better
> ideas?
> 
> try:
> text=texts[i]
> text=text.encode('latin-1')
> text=text.encode('utf-8')
> except:
> text=' '

Don't write bare excepts, always catch the error you want and nothing 
else. As you've written it, the result of encoding with latin-1 is thrown 
away, even if it succeeds.

text = texts[i]  # Don't hide errors here.
try:
text = text.encode('latin-1')
except UnicodeEncodeError:
try:
text = text.encode('utf-8')
except UnicodeEncodeError:
text = ' '
do_something_with(text)

Another thing you might consider is setting the error handler:

text = text.encode('utf-8', errors='ignore')

Other error handlers are 'strict' (the default), 'replace' and 
'xmlcharrefreplace'.

-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode error

2010-07-23 Thread dirknbr

I am having some problems with unicode from json.

This is the error I get

UnicodeEncodeError: 'ascii' codec can't encode character u'\x93' in
position 61: ordinal not in range(128)

I have kind of developped this but obviously it's not nice, any better
ideas?

try:
text=texts[i]
text=text.encode('latin-1')
text=text.encode('utf-8')
except:
text=' '

Dirk
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python 2.4 vs 2.5 - Unicode error

2009-01-22 Thread Gaurav Veda

On Jan 21, 7:08 pm, John Machin  wrote:
>
> To replace non-ASCII characters in a UTF-8-encoded string by spaces:
> | >>> u8 = ' and 25\xc2\xb0F'
> | >>> u = u8.decode('utf8')
> | >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
> | ' and 25 F'

Thanks John for your reply. This is what I needed.

Cheers,
Gaurav
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 2.4 vs 2.5 - Unicode error

2009-01-21 Thread Wolfgang Rohdewald

On Mittwoch, 21. Januar 2009, Gaurav Veda wrote:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 4357: ordinal not in range(128)
> 
> Before sending the (insert) query to the mysql server, I do the
> following which I think should've taken care of this problem:
>  sqlStr = sqlStr.replace('\\', '')

you might consider using what mysql offers about unicode: save
all strings encoded as unicode. Might be more work now but I think
it would be a good investment in the future.

have a look at the mysql documentation for

mysql_real_escape_string() takes care of quoted chars. 

mysql_set_character_set() for setting the character set used
by the database connection

you can ensure that the web page is unicode by doing something
like

charsetregex = re.compile(r'charset=(.*?)[\"&]')
charsetmatch = charsetregex.search(page)
if charsetmatch:
   charset=charsetmatch.group(1)
   utf8Text = unicode(page,charset)

-- 
Wolfgang
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 2.4 vs 2.5 - Unicode error

2009-01-21 Thread John Machin

On Jan 22, 9:50 am, Gaurav Veda  wrote:
> > The 0xc2 strongly suggests that you are feeding the beast data encoded
> > in UTF-8 while giving it no reason to believe that it is in fact not
> > encoded in ASCII. Curiously the first errant byte is a long way (4KB)
> > into your data. Consider doing
> >     print repr(data)
> > to see what you've actually got there.
> >>> sqlStr[4352:4362]
>
> ' and 25\xc2\xb0F'

That's the UTF-8 version of ' and 25°F' where the character between
the 25 and the F is U+00B0 DEGREE SIGN ... interesting stuff to have
in an SQL query string.

>
> All I want to do is to just replace all the non-ascii characters by a
> space.

I can't imagine why you would want to do that to data, let alone to an
SQL query.

I can't see any evidence that you actually tried to do that, anyway.

To replace non-ASCII characters in a UTF-8-encoded string by spaces:
| >>> u8 = ' and 25\xc2\xb0F'
| >>> u = u8.decode('utf8')
| >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u])
| ' and 25 F'

>
> > I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
> > different versions of mysql, perhaps?
>
> I am trying to put content into the mysql server running on machine A,
> from machine B & machine C with different versions of python. So I
> don't think this is a mysql issue.

Terminology confusion. Consider the possibility of different versions
of MySQLdb (the client interface package) on the client machines B and
C.

Also consider the possibility that you didn't run exactly the same
code on B and C.

> > Show at the very least the full traceback that you get. Try to write a
> > short script that demonstrates the problem with 2.5 and no problem
> > with 2.4, so that (a) it is apparent what you are doing (b) the
> > problem can be reproduced if necessary by someone with access to
> > mysql.

How about a very small script which includes the minimum necessary to
run these two lines (with appropriate substitutions for column_x and
table_y:
sql_str = "select column_x from table_y where column_x = '\xc2\xb0'"
cursor.execute(sql_str)

and run that on B and C

>
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "putDataIntoDB.py", line 164, in 
>     cursor.execute(sqlStr)
>   File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
> 146, in execute
>     query = query.encode(charset)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 4359: ordinal not in range(128)
>
> > You might like to explain why you think that doubling backslashes in
> > your SQL is a good idea, and amplify "some processing on the text".
>
> I thought this will achieve 2 things.
> a) It will escape any unicode character (obviously, I was wrong. Got
> carried away by the display. I thought \xc2 will get escaped to \\xc2,
> which is completely preposterous).
> b) It will make sure that the escape sequences in the string (e.g.
> '\n') are received by mysql as an escape sequence.

Run-time programmatic fiddling with an SQL query string is dangerous
and tricky at the best of times, worse when you don't inspect the
result before you press the launch button.

Cheers,
John
--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 2.4 vs 2.5 - Unicode error

2009-01-21 Thread Gaurav Veda

> The 0xc2 strongly suggests that you are feeding the beast data encoded
> in UTF-8 while giving it no reason to believe that it is in fact not
> encoded in ASCII. Curiously the first errant byte is a long way (4KB)
> into your data. Consider doing
> print repr(data)
> to see what you've actually got there.

>>> sqlStr[4352:4362]
' and 25\xc2\xb0F'

All I want to do is to just replace all the non-ascii characters by a
space.

> I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
> different versions of mysql, perhaps?

I am trying to put content into the mysql server running on machine A,
from machine B & machine C with different versions of python. So I
don't think this is a mysql issue.

> Show at the very least the full traceback that you get. Try to write a
> short script that demonstrates the problem with 2.5 and no problem
> with 2.4, so that (a) it is apparent what you are doing (b) the
> problem can be reproduced if necessary by someone with access to
> mysql.

Traceback (most recent call last):
  File "", line 1, in 
  File "putDataIntoDB.py", line 164, in 
cursor.execute(sqlStr)
  File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line
146, in execute
query = query.encode(charset)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4359: ordinal not in range(128)

> You might like to explain why you think that doubling backslashes in
> your SQL is a good idea, and amplify "some processing on the text".

I thought this will achieve 2 things.
a) It will escape any unicode character (obviously, I was wrong. Got
carried away by the display. I thought \xc2 will get escaped to \\xc2,
which is completely preposterous).
b) It will make sure that the escape sequences in the string (e.g.
'\n') are received by mysql as an escape sequence.

Thanks for your reply!
Gaurav

> HTH,
> John

--
http://mail.python.org/mailman/listinfo/python-list

Re: Python 2.4 vs 2.5 - Unicode error

2009-01-21 Thread John Machin

On Jan 22, 4:49 am, Gaurav Veda  wrote:
> Hi,
>
> I am trying to put some webpages into a mysql database using python
> (after some processing on the text). If I use Python 2.4.2, it works
> without a fuss. However, on Python 2.5, I get the following error:
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
> 4357: ordinal not in range(128)
>
> Before sending the (insert) query to the mysql server, I do the
> following which I think should've taken care of this problem:
>  sqlStr = sqlStr.replace('\\', '')
>
> (where sqlStr is the query).
>
> Any suggestions?

The 0xc2 strongly suggests that you are feeding the beast data encoded
in UTF-8 while giving it no reason to believe that it is in fact not
encoded in ASCII. Curiously the first errant byte is a long way (4KB)
into your data. Consider doing
print repr(data)
to see what you've actually got there.

I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion --
different versions of mysql, perhaps?

Show at the very least the full traceback that you get. Try to write a
short script that demonstrates the problem with 2.5 and no problem
with 2.4, so that (a) it is apparent what you are doing (b) the
problem can be reproduced if necessary by someone with access to
mysql.

You might like to explain why you think that doubling backslashes in
your SQL is a good idea, and amplify "some processing on the text".

HTH,
John
--
http://mail.python.org/mailman/listinfo/python-list

Python 2.4 vs 2.5 - Unicode error

2009-01-21 Thread Gaurav Veda

Hi,

I am trying to put some webpages into a mysql database using python
(after some processing on the text). If I use Python 2.4.2, it works
without a fuss. However, on Python 2.5, I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
4357: ordinal not in range(128)

Before sending the (insert) query to the mysql server, I do the
following which I think should've taken care of this problem:
 sqlStr = sqlStr.replace('\\', '')

(where sqlStr is the query).

Any suggestions?

Thanks!
Gaurav
--
http://mail.python.org/mailman/listinfo/python-list

Re: odd unicode error

2007-04-12 Thread tubby

Martin v. Löwis wrote:
>> path += '/' + b
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1:
>> ordinal not in range(128)
>>
>> Any ideas?
> 
> path is a Unicode string, b is a byte string and contains the
> byte \xd0.
> 
> The problem is that you have a directory with file names in it that
> cannot be converted to Unicode strings, using the file system
> encoding. If you can't fix the file system, you have to make
> search_path a byte string.
> 
> Regards,
> Martin

I fixed it... I didn't tell the whole story. The interface uses 
wxpython. It returns a unicode pathname that os.walk() uses. I changed 
that pathname with str() and now, it no longer barfs.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: odd unicode error

2007-04-12 Thread Martin v. Löwis

> path += '/' + b
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1:
> ordinal not in range(128)
> 
> Any ideas?

path is a Unicode string, b is a byte string and contains the
byte \xd0.

The problem is that you have a directory with file names in it that
cannot be converted to Unicode strings, using the file system
encoding. If you can't fix the file system, you have to make
search_path a byte string.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

odd unicode error

2007-04-12 Thread tubby

This:

for root, dirs, files in os.walk(search_path):
 for f in files:
 print f

###

Produces this:

Traceback (most recent call last):
   File "/home/brad/Desktop/my_script.pyw", line 340, in -toplevel-
 hunt(target_files(search_path, skip_file_extensions(), 
skip_files()), path_to_results)
   File "/home/brad/Desktop/my_script.pyw", line 161, in target_files
 for root, dirs, files in os.walk(search_path):
   File "os.py", line 291, in walk
 for x in walk(path, topdown, onerror):
   File "os.py", line 291, in walk
 for x in walk(path, topdown, onerror):
   File "os.py", line 281, in walk
 if isdir(join(top, name)):
   File "posixpath.py", line 65, in join
 path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: 
ordinal not in range(128)

##

I'm running Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02)
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2

Any ideas? I can't catch this with try/except and using unicode(f) 
doesn't help either.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-31 Thread Walter Dörwald

[EMAIL PROTECTED] wrote:
> On Jan 30, 11:28 pm, Walter Dörwald <[EMAIL PROTECTED]> wrote:
> 
>> codecs.register_error("transliterate", transliterate)
>>
>>Walter
> 
> Really, really slick solution.
> Though, why was it [:1], not [0]? ;-)

No particular reason, unicodedata.normalize("NFD", ...) should never
return an empty string.

> And one more thing:
>> def transliterate(exc):
>> if not isinstance(exc, UnicodeEncodeError):
>> raise TypeError("don'ty know how to handle %r" % r)
> I don't understand what %r and r are and where they are from. The man
> 3 printf page doesn't have %r formatting.

%r means format the repr() result, and r was supposed to be exc. ;)

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-31 Thread Walter Dörwald

Martin v. Löwis wrote:

> Walter Dörwald schrieb:
>> You might try the following:
>>
>> # -*- coding: iso-8859-1 -*-
>>
>> import unicodedata, codecs
>>
>> def transliterate(exc):
>>  if not isinstance(exc, UnicodeEncodeError):
>>  raise TypeError("don'ty know how to handle %r" % r)
>>  return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
>> exc.start+1)
> 
> I think a number of special cases need to be studied here.
> I would expect that this is "semantically correct" if the characters
> being dropped are combining characters (at least in the languages I'm
> familiar with, it is common to drop them for transliteration).

True, it might make sense to limit the error handler to handling latin 
characters.

> However, if you do
> 
> py> for i in range(65536):
> ...   c = unicodedata.normalize("NFD", unichr(i))
> ...   for c2 in c[1:]:
> ... if not unicodedata.combining(c2): print hex(i),;break
> 
> you'll see that there are many characters which don't decompose
> into a base character + sequence of combining characters. In
> particular, this involves all hangul syllables (U+AC00..U+D7A3),
> for which it is just incorrect to drop the "jungseongs"
> (is that proper wording?).

Of course the above error handler only makes sense, when the decomposed 
codepoints are encodable in the target encoding. For your hangul example 
neither u"\ac00" nor the decomposed version u"\u1100\u1161" er encodable.

> There are also some cases which I'm completely uncertain about,
> e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E +
> ORIYA AI LENGTH MARK. Is it correct to drop the length mark?
> It's not listed as a combining character. Likewise,
> MYANMAR LETTER UU decomposes to MYANMAR LETTER U +
> MYANMAR VOWEL SIGN II; same question here.

Servus,
Walter

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-31 Thread Gabriel Genellina

En Wed, 31 Jan 2007 01:21:49 -0300, [EMAIL PROTECTED]  
<[EMAIL PROTECTED]> escribió:

> I don't understand what %r and r are and where they are from. The man
> 3 printf page doesn't have %r formatting.

Perhaps you should look into the Python docs instead?

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-30 Thread Martin v. Löwis

Walter Dörwald schrieb:
> You might try the following:
> 
> # -*- coding: iso-8859-1 -*-
> 
> import unicodedata, codecs
> 
> def transliterate(exc):
>   if not isinstance(exc, UnicodeEncodeError):
>   raise TypeError("don'ty know how to handle %r" % r)
>   return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
> exc.start+1)

I think a number of special cases need to be studied here.
I would expect that this is "semantically correct" if the characters
being dropped are combining characters (at least in the languages I'm
familiar with, it is common to drop them for transliteration).

However, if you do

py> for i in range(65536):
...   c = unicodedata.normalize("NFD", unichr(i))
...   for c2 in c[1:]:
... if not unicodedata.combining(c2): print hex(i),;break

you'll see that there are many characters which don't decompose
into a base character + sequence of combining characters. In
particular, this involves all hangul syllables (U+AC00..U+D7A3),
for which it is just incorrect to drop the "jungseongs"
(is that proper wording?).

There are also some cases which I'm completely uncertain about,
e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E +
ORIYA AI LENGTH MARK. Is it correct to drop the length mark?
It's not listed as a combining character. Likewise,
MYANMAR LETTER UU decomposes to MYANMAR LETTER U +
MYANMAR VOWEL SIGN II; same question here.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-30 Thread [EMAIL PROTECTED]

On Jan 30, 11:28 pm, Walter Dörwald <[EMAIL PROTECTED]> wrote:

>
> codecs.register_error("transliterate", transliterate)
>
>Walter

Really, really slick solution.
Though, why was it [:1], not [0]? ;-)

And one more thing:
> def transliterate(exc):
> if not isinstance(exc, UnicodeEncodeError):
> raise TypeError("don'ty know how to handle %r" % r)
I don't understand what %r and r are and where they are from. The man
3 printf page doesn't have %r formatting.

Thanks for the tip.
Hieu

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-30 Thread Walter Dörwald

Rares Vernica wrote:
> Hi,
> 
> Does anyone know of any Unicode encode/decode error handler that does a 
> better replace job than the default replace error handler?
> 
> For example I have an iso-8859-1 string that has an 'e' with an accent 
> (you know, the French 'e's). When I use s.encode('ascii', 'replace') the 
> 'e' will be replaced with '?'. I would prefer to be replaced with an 'e' 
> even if I know it is not 100% correct.
> 
> If only this letter would be the problem I would do it manually, but 
> there is an entire set of letters that need to be replaced with their 
> closest ascii letter.
> 
> Is there an encode/decode error handler that can replace all the 
> not-ascii letters from iso-8859-1 with their closest ascii letter?

You might try the following:

# -*- coding: iso-8859-1 -*-

import unicodedata, codecs

def transliterate(exc):
if not isinstance(exc, UnicodeEncodeError):
raise TypeError("don'ty know how to handle %r" % r)
return (unicodedata.normalize("NFD", exc.object[exc.start])[:1],
exc.start+1)

codecs.register_error("transliterate", transliterate)

print u"Frédéric Chopin".encode("ascii", "transliterate")

Running this script gives you:
$ python transliterate.py
Frederic Chopin

Hope that helps.

Servus,
   Walter
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-26 Thread Rares Vernica

It does the job.

Thanks a lot,
Ray

Peter Otten wrote:
> Rares Vernica wrote:
> 
>> Is there an encode/decode error handler that can replace all the
>> not-ascii letters from iso-8859-1 with their closest ascii letter?
> 
> A mapping, not an error handler, but it might do the job:
> 
> http://effbot.org/zone/unicode-convert.htm
> 
> Peter

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-26 Thread Robert Kern

Rares Vernica wrote:
> Is there an encode/decode error handler that can replace all the 
> not-ascii letters from iso-8859-1 with their closest ascii letter?

No, but IBM's ICU library can transform one script to another in very flexible
and capable ways. One such configuration can do what you ask.

  http://www-306.ibm.com/software/globalization/icu/index.jsp
  http://icu.sourceforge.net/userguide/Transform.html

Unfortunately, I don't think any of the available ICU bindings for Python have
exposed this functionality. If you wanted to contribute such, you might want to
start with PyICU. It seems to be the most actively developed of the bindings.

  http://pyicu.osafoundation.org/

Of course, that's overkill for this problem. Those transformations can handle
such things as this:

  Αλφαβητικός Κατάλογος Alphabētikós Katálogos

The number of characters in iso-8859-1 that you would want to transliterate is
not all that large. You could spend a little bit of time going through the
character set and making a translation map for str.translate().

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error handler

2007-01-26 Thread Peter Otten

Rares Vernica wrote:

> Is there an encode/decode error handler that can replace all the
> not-ascii letters from iso-8859-1 with their closest ascii letter?

A mapping, not an error handler, but it might do the job:

http://effbot.org/zone/unicode-convert.htm

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode error handler

2007-01-26 Thread Rares Vernica

Hi,

Does anyone know of any Unicode encode/decode error handler that does a 
better replace job than the default replace error handler?

For example I have an iso-8859-1 string that has an 'e' with an accent 
(you know, the French 'e's). When I use s.encode('ascii', 'replace') the 
'e' will be replaced with '?'. I would prefer to be replaced with an 'e' 
even if I know it is not 100% correct.

If only this letter would be the problem I would do it manually, but 
there is an entire set of letters that need to be replaced with their 
closest ascii letter.

Is there an encode/decode error handler that can replace all the 
not-ascii letters from iso-8859-1 with their closest ascii letter?

Thanks a lot,
Ray

-- 
http://mail.python.org/mailman/listinfo/python-list

RE: Unicode Error

2006-08-23 Thread Tim Golden

[Gallagher, Tim (NE)]

| Hey all I am learning Python and having a fun time doing so.  
| I have a question for y'all, it has to do with active directory.

| I want to get the last login for a computer from Active 
| Directory.  I am using the active_directory module and here 
| is my code.

[START]
import active_directory
computer = active_directory.root()
for cpu in computer.search ("cn='Computer_Name'"): 
print cpu.samAccountName#←--- Works find
print cpu.operatingSystem   #←--- Works find
print cpu.lastLogon #←--- Getting Error
[END]

| I get an error that I am not sure what to do with, the error 
| is TypeError: coercing to Unicode: need string or buffer, 
| instance found in my line Do I have to change the output to 
| meet Unicode formation?

I started to write an explanation of Unicode and what an
encoding was and why you needed it, but then I realised
that it wouldn't help - at least not here - because the
problem seems to involve converting the value in cpu.lastLogon
to Unicode. And I'm not sure why it's even trying to do that.

The lastLogon value (according to the MS docs) is actually
a structure in its own right with a HighPart and a LowPart,
and you perform various maths on these numbers to give
you a real date. In my case (cf code below) if I simply print the 
lastLogon, I get the anonymous  string.


import active_directory
me = active_directory.find_computer ()
print me.samAccountName

print me.lastLogon 
# gives >

print me.lastLogon.HighPart, me.lastLogon.LowPart
# gives two long numbers



Short answer, try lastLogon.HighPart & lastLogon.LowPart

TJG


This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk

-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode Error

2006-08-22 Thread Gallagher, Tim (NE)










Hey all I am learning Python
and having a fun time doing so.  I have a question for y'all, it has to do with
active directory.

 

I want to get the last login
for a computer from Active Directory.  I am using the active_directory module
and here is my code.

 

[START]

 

import active_directory

 

computer =
active_directory.root()

for cpu in computer.search
("cn='Computer_Name'"): 

    print cpu.samAccountName    ←---
Works find

    print
cpu.operatingSystem   ←--- Works find

    print cpu.lastLogon ←---
Getting Error

 

[END]

 

 

I get an error that I am not sure what to do with,
the error is TypeError: coercing to Unicode: need string or buffer, instance
found in my line Do I have to change the output to meet Unicode formation?

 

Thanks,

-T






-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode error

2006-03-17 Thread Scott David Daniels

[EMAIL PROTECTED] wrote:
> I have this python code:
> print >> htmlFile, " style=\"width: 200px; height:18px;\">";
> 
> 
> But that caues this error, and I can't figure it out why. Any help is
> appreicate
>  File "./run.py", line 193, in ?
> print >> htmlFile, " style=\"width: 200px; height:18px;\">";
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9:
> ordinal not in range(128)
> 
> Thanks.
> 
You can make the code easier to read by using single quotes to quote
strings with double quotes inside:

  print >> htmlFile, ('')

Or even better:

  print >> htmlFile, (u'') % unicode(1)

The unicode(1) confuses me -- you are converting an integer to its
string representation in unicode (do you know that?), not picking a
particular character.

  print >> htmlFile, (u'') % (1,)

And if you don't mean to be writing unicode, you could use:

  print >> htmlFile, ('') % (1,)

--Scott David Daniels
[EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: unicode error

2006-03-17 Thread jean-michel bain-cornu

[EMAIL PROTECTED] wrote:
> I have this python code:
> print >> htmlFile, " style=\"width: 200px; height:18px;\">";
> 
> 
> But that caues this error, and I can't figure it out why. Any help is
> appreicate
>  File "./run.py", line 193, in ?
> print >> htmlFile, " style=\"width: 200px; height:18px;\">";
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9:
> ordinal not in range(128)
> 
> Thanks.
> 
Hi,
I tried and it worked (wrote into the file:).
Can you try to isolate exactly what part of the code is wrong ?
jm
Here is the complete code:
htmlfile=file('jmbc.txt','w')
print >> htmlfile, "";
htmlfile.close()
-- 
http://mail.python.org/mailman/listinfo/python-list

unicode error

2006-03-17 Thread Allerdyce . John

I have this python code:
print >> htmlFile, "";


But that caues this error, and I can't figure it out why. Any help is
appreicate
 File "./run.py", line 193, in ?
print >> htmlFile, "";
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9:
ordinal not in range(128)

Thanks.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode error in wx_gdi ?

2005-03-04 Thread Serge Orlov

Erik  Bethke wrote:
> Hello All,
>
> I still shaking out my last few bugs in my tile matching game:
>
> I am now down to one stumper for me:
>  1) when I initialize wxPython
>  2) from an exe that I have created with py2exe
>  3) when the executable is located on the desktop as opposed to
> somewhere on C or D directly
>  4) when My Desktop is not written in ascii but instead Korean hangul
>
> I get this error:
>
> Traceback (most recent call last):
>   File "shanghai.py", line 13, in ?
>   File "wxPython\__init__.pyc", line 10, in ?
>   File "wxPython\_wx.pyc", line 3, in ?
>   File "wxPython\_core.pyc", line 15, in ?
>   File "wx\__init__.pyc", line 42, in ?
>   File "wx\_core.pyc", line 10994, in ?
>   File "wx\_gdi.pyc", line 2443, in ?
>   File "wx\_gdi.pyc", line 2340, in Locale_AddCatalogLookupPathPrefix
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position
> 26: ordinal not in range(128)
>
> Granted this may seem like an obscure error,

Thanks to your explanation, it doesn't look very obscure. I think
the code in wxpython either uses sys.path[0] or __file__. Python
still keeps byte strings in there because of backward compatibility.

> What do i do from here?  Do I go into wx_gdi.py and fix it so that it
> uses unicode instead of ascii?  I have not yet made any changes to
> other people's libraries...

You should contact wxpython people for proper cross platform fix,
meanwhile you can fix that particular error on windows
by changing sys.path[0] into
sys.path[0].decode(sys.getfilesystemencoding())
or do the same thing for __file__. If there are a lot of similar
problems, you can call sys.setdefaultencoding('mbcs') at the start of
your program as last resort. Don't tell anyone I suggested that :)
and remember that sys.setdefaultencoding is removed in site.py,
changing default encoding can mask encoding bugs and make those
bugs hard to trace.

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list

Unicode error in wx_gdi ?

2005-03-04 Thread Erik Bethke

Hello All,

I still shaking out my last few bugs in my tile matching game:

I am now down to one stumper for me:
 1) when I initialize wxPython
 2) from an exe that I have created with py2exe
 3) when the executable is located on the desktop as opposed to
somewhere on C or D directly
 4) when My Desktop is not written in ascii but instead Korean hangul

I get this error:

Traceback (most recent call last):
  File "shanghai.py", line 13, in ?
  File "wxPython\__init__.pyc", line 10, in ?
  File "wxPython\_wx.pyc", line 3, in ?
  File "wxPython\_core.pyc", line 15, in ?
  File "wx\__init__.pyc", line 42, in ?
  File "wx\_core.pyc", line 10994, in ?
  File "wx\_gdi.pyc", line 2443, in ?
  File "wx\_gdi.pyc", line 2340, in Locale_AddCatalogLookupPathPrefix
UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position
26: ordinal not in range(128)

Granted this may seem like an obscure error, but the net effect is that
I cannot use wxPython for my games and applications as many of my users
will place the executable directly on their desktop and the path of the
desktop contains non-ascii paths.

What do i do from here?  Do I go into wx_gdi.py and fix it so that it
uses unicode instead of ascii?  I have not yet made any changes to
other people's libraries...

Any help would be much appreciated,
-Erik

-- 
http://mail.python.org/mailman/listinfo/python-list

98 matches

Mail list logo