Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I would say so as well. Thanks to everyone who helped. Regards and best wishes. -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On 05/10/2015 05:10 PM, zljubisic...@gmail.com wrote: No, we can't see what ROOTDIR is, since you read it from the config file. And you don't show us the results of those prints. You don't even show us the full exception, or even the line it fails on. Sorry I forgot. This is the output of the script: C:\Python34\python.exe C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py C:\Users\zoran\hrt Traceback (most recent call last): File "C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py", line 43, in with open(src_file, mode='w', encoding='utf-8') as s_file: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\zoran\\hrt\\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt' 70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o 260 C:\Users\zoran\hrt\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt 260 C:\Users\zoran\hrt\des_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt Process finished with exit code 1 Cfg file has the following contents: C:\Users\zoran\PycharmProjects\mm_align\hrt3.cfg contents [Dir] ROOTDIR = C:\Users\zoran\hrt I doubt that the problem is in the ROODIR value, but of course nothing in your program bothers to check that that directory exists. I expect you either have too many characters total, or the 232th character is a strange one. Or perhaps title has a backslash in it (you took care of forward slash). How to determine that? Probably by calling os.path.isdir() While we're at it, if you do have an OS limitation on size, your code is truncating at the wrong point. You need to truncate the title based on the total size of src_file and dst_file, and since the code cannot know the size of ROOTDIR, you need to include that in your figuring. Well, in my program I am defining a file name as category-id-description.mp3. If the file is too long I am cutting description (it wasn't clear from my example). Since you've got non-ASCII characters in that name, the utf-8 version of the name will be longer. I don't run Windows, but perhaps it's just a length problem after all. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
> No, we can't see what ROOTDIR is, since you read it from the config > file. And you don't show us the results of those prints. You don't > even show us the full exception, or even the line it fails on. Sorry I forgot. This is the output of the script: C:\Python34\python.exe C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py C:\Users\zoran\hrt Traceback (most recent call last): File "C:/Users/zoran/PycharmProjects/mm_align/bckslash_test.py", line 43, in with open(src_file, mode='w', encoding='utf-8') as s_file: FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\zoran\\hrt\\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt' 70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o 260 C:\Users\zoran\hrt\src_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt 260 C:\Users\zoran\hrt\des_70._godišnjica_pobjede_nad_fašizmom_Zašto_većina_čelnika_Europske_unije_bojkotira_vojnu_paradu_u_Moskvi__Kako_će_se_obljetnica_pobjede_nad_nacističkom_Njemačkom_i_njenim_satelitima_obilježiti_u_našoj_zemlji__Hoće_li_Josip_Broz_Tito_o.txt Process finished with exit code 1 Cfg file has the following contents: C:\Users\zoran\PycharmProjects\mm_align\hrt3.cfg contents [Dir] ROOTDIR = C:\Users\zoran\hrt > I doubt that the problem is in the ROODIR value, but of course nothing > in your program bothers to check that that directory exists. I expect > you either have too many characters total, or the 232th character is a > strange one. Or perhaps title has a backslash in it (you took care of > forward slash). How to determine that? > While we're at it, if you do have an OS limitation on size, your code is > truncating at the wrong point. You need to truncate the title based on > the total size of src_file and dst_file, and since the code cannot know > the size of ROOTDIR, you need to include that in your figuring. Well, in my program I am defining a file name as category-id-description.mp3. If the file is too long I am cutting description (it wasn't clear from my example). Regards. -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
> > It works, but if you change title = title[:232] to title = title[:233], > > you will get "FileNotFoundError: [Errno 2] No such file or directory". > > > Which is a *completely different* error from > > SyntaxError: 'unicodeescape' codec can't decode bytes in position 2-3: > truncated \U escape I don't know when the original error disappeared and become this one (confused). Regards. -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On Sun, May 10, 2015 at 1:13 AM, Steven D'Aprano wrote: > FileNotFoundError means that the program did run, it tried to open a file, > but the file doesn't exist. Normally it does, at least. Sometimes it means that a *directory* doesn't exist (for instance, you can get this when you try to create a new file, which otherwise wouldn't make sense), and occasionally, Windows will give you rather peculiar errors when weird things go wrong, which may be what's going on here (maximum path length - though that can be overridden by switching to a UNC-style path). Steven's point still stands - very different from SyntaxError - but unfortunately it's not always as simple as the name suggests. Thank you oh so much, Windows. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On Sat, 9 May 2015 08:31 pm, zljubisic...@gmail.com wrote: > It works, but if you change title = title[:232] to title = title[:233], > you will get "FileNotFoundError: [Errno 2] No such file or directory". Which is a *completely different* error from SyntaxError: 'unicodeescape' codec can't decode bytes in position 2-3: truncated \U escape > As you can see ROOTDIR contains \U. How can I possibly see that? Your code reads ROOTDIR from the config file, which you don't show us. I agree with you that Windows has limitations on the length of file names, and that you get an error if you give a file name that cannot be found. The point is that before you can get that far, you *first* have to fix the SyntaxError. That's a completely different problem. You can't fix the \U syntax error by truncating the total file length. But you can fix that syntax error by changing your code so it reads the ROOTDIR from a config file instead of a hard-coded string literal -- exactly like we told you to do! An essential skill when programming is to read and understand the error messages. One of the most painful things to use is a programming language that just says "An error occurred" with no other explanation. Python gives you lots of detail to explain what went wrong: SyntaxError means you made an error in the syntax of the code and the program cannot even run. FileNotFoundError means that the program did run, it tried to open a file, but the file doesn't exist. They're a little bit different, don't you agree? -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On 05/09/2015 06:31 AM, zljubisic...@gmail.com wrote: title = title[:232] title = title.replace(" ", "_").replace("/", "_").replace("!", "_").replace("?", "_")\ .replace('"', "_").replace(':', "_").replace(',', "_").replace('"', '')\ .replace('\n', '_').replace(''', '') print(title) src_file = os.path.join(ROOTDIR, 'src_' + title + '.txt') dst_file = os.path.join(ROOTDIR, 'des_' + title + '.txt') print(len(src_file), src_file) print(len(dst_file), dst_file) with open(src_file, mode='w', encoding='utf-8') as s_file: s_file.write('test') shutil.move(src_file, dst_file) It works, but if you change title = title[:232] to title = title[:233], you will get "FileNotFoundError: [Errno 2] No such file or directory". As you can see ROOTDIR contains \U. No, we can't see what ROOTDIR is, since you read it from the config file. And you don't show us the results of those prints. You don't even show us the full exception, or even the line it fails on. I doubt that the problem is in the ROODIR value, but of course nothing in your program bothers to check that that directory exists. I expect you either have too many characters total, or the 232th character is a strange one. Or perhaps title has a backslash in it (you took care of forward slash). While we're at it, if you do have an OS limitation on size, your code is truncating at the wrong point. You need to truncate the title based on the total size of src_file and dst_file, and since the code cannot know the size of ROOTDIR, you need to include that in your figuring. -- DaveA -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Steven, please do look at the code bellow: # C:\Users\zoran\PycharmProjects\mm_align\hrt3.cfg contents # [Dir] # ROOTDIR = C:\Users\zoran\hrt import os import shutil import configparser import requests import re Config = configparser.ConfigParser() Config.optionxform = str # preserve case in ini file cfg_file = os.path.join('C:\\Users\\zoran\\PycharmProjects\\mm_align\\hrt3.cfg' ) Config.read(cfg_file) ROOTDIR = Config.get('Dir', 'ROOTDIR') print(ROOTDIR) html = requests.get("http://radio.hrt.hr/prvi-program/arhiva/ujutro-prvi-poligraf-politicki-grafikon/118/";).text art_html = re.search('(.+?)', html, re.DOTALL).group(1) for p_tag in re.finditer(r'(.*?)', art_html, re.DOTALL): if '' not in p_tag.group(1): title = p_tag.group(1) title = title[:232] title = title.replace(" ", "_").replace("/", "_").replace("!", "_").replace("?", "_")\ .replace('"', "_").replace(':', "_").replace(',', "_").replace('"', '')\ .replace('\n', '_').replace(''', '') print(title) src_file = os.path.join(ROOTDIR, 'src_' + title + '.txt') dst_file = os.path.join(ROOTDIR, 'des_' + title + '.txt') print(len(src_file), src_file) print(len(dst_file), dst_file) with open(src_file, mode='w', encoding='utf-8') as s_file: s_file.write('test') shutil.move(src_file, dst_file) It works, but if you change title = title[:232] to title = title[:233], you will get "FileNotFoundError: [Errno 2] No such file or directory". As you can see ROOTDIR contains \U. Regards. -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On Sat, 9 May 2015 06:39 am, zljubisic...@gmail.com wrote: > Thanks for clarifying. > Looks like the error message was wrong. No, the error message was right. Your problem was that you used backslashes in *Python program code*, rather than reading it from a text file. In Python, a string-literal containing \U is an escape sequence which expects exactly 8 hexadecimal digits to follow: py> path = '\U00a7' py> print(path) § If you don't follow the \U with eight hex digits, you get an error: py> path = '\Users~~~~' File "", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 4-6: truncated \U escape This applies only to string literals in code. For data read from files, backslash \ is just an ordinary character which has no special meaning. > On windows ntfs I had a file name more than 259 characters which is widows > limit. After cutting file name to 259 characters everything works as it > should. If I cut file name to 260 characters I get the error from subject > which is wrong. What you describe is impossible. You cannot possibly get a SyntaxError at compile time because the path is too long. You must have made other changes at the same time, such as using a raw string r'C: ... \Users\ ...'. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On Sat, May 9, 2015 at 5:00 AM, wrote: > But it returns the following error: > > > C:\Python34\python.exe C:/Users/bckslash_test.py > File "C:/Users/bckslash_test.py", line 4 > ROOTDIR = 'C:\Users' > ^ > SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in > position 2-3: truncated \U escape Strong suggestion: Use forward slashes for everything other than what you show to a human - and maybe even then (some programs have always printed stuff out that way - zip/unzip, for instance). The backslash has special meaning in many contexts, and you'll just save yourself so much trouble... ROOTDIR = 'C:/Users/zoran' Problem solved! ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Thanks for clarifying. Looks like the error message was wrong. On windows ntfs I had a file name more than 259 characters which is widows limit. After cutting file name to 259 characters everything works as it should. If I cut file name to 260 characters I get the error from subject which is wrong. Anyway case closed, thank you very much because I was suspecting that something is wrong with configparser. Best regards. -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On 2015-05-08 20:00, zljubisic...@gmail.com wrote: The script is very simple (abc.txt exists in ROOTDIR directory): import os import shutil ROOTDIR = 'C:\Users\zoran' file1 = os.path.join(ROOTDIR, 'abc.txt') file2 = os.path.join(ROOTDIR, 'def.txt') shutil.move(file1, file2) But it returns the following error: C:\Python34\python.exe C:/Users/bckslash_test.py File "C:/Users/bckslash_test.py", line 4 ROOTDIR = 'C:\Users' ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \U escape Process finished with exit code 1 As I saw, I could solve the problem by changing line 4 to (small letter "r" before string: ROOTDIR = r'C:\Users\zoran' but that is not an option for me because I am using configparser in order to read the ROOTDIR from underlying cfg file. I need a mechanism to read the path string with single backslashes into a variable, but afterwards to escape every backslash in it. How to do that? If you're reading the path from a file, it's not a problem. Try it! -- https://mail.python.org/mailman/listinfo/python-list
Re: SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
On Fri, May 8, 2015, at 15:00, zljubisic...@gmail.com wrote: > As I saw, I could solve the problem by changing line 4 to (small letter > "r" before string: > ROOTDIR = r'C:\Users\zoran' > > but that is not an option for me because I am using configparser in order > to read the ROOTDIR from underlying cfg file. configparser won't have that problem, since "escaping" is only an issue for python source code. No escaping for backslashes is necessary in files read by configparser. >>> import sys >>> import configparser >>> config = configparser.ConfigParser() >>> config['DEFAULT'] = {'ROOTDIR': r'C:\Users\zoran'} >>> config.write(sys.stdout) [DEFAULT] rootdir = C:\Users\zoran -- https://mail.python.org/mailman/listinfo/python-list
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
The script is very simple (abc.txt exists in ROOTDIR directory): import os import shutil ROOTDIR = 'C:\Users\zoran' file1 = os.path.join(ROOTDIR, 'abc.txt') file2 = os.path.join(ROOTDIR, 'def.txt') shutil.move(file1, file2) But it returns the following error: C:\Python34\python.exe C:/Users/bckslash_test.py File "C:/Users/bckslash_test.py", line 4 ROOTDIR = 'C:\Users' ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \U escape Process finished with exit code 1 As I saw, I could solve the problem by changing line 4 to (small letter "r" before string: ROOTDIR = r'C:\Users\zoran' but that is not an option for me because I am using configparser in order to read the ROOTDIR from underlying cfg file. I need a mechanism to read the path string with single backslashes into a variable, but afterwards to escape every backslash in it. How to do that? -- https://mail.python.org/mailman/listinfo/python-list
Re: API for custom Unicode error handlers
On 10/4/2013 3:35 PM, Serhiy Storchaka wrote: 04.10.13 16:56, Steven D'Aprano написав(ла): I have some custom Unicode error handlers, and I'm looking for advice on the right API for dealing with them. I'm planning to built this error handler in 3.4 (see http://comments.gmane.org/gmane.comp.python.ideas/21296). Should the module holding the error handlers automatically register them? This question interesting me too. I did not respond on the p-i thread, but +1 for 'namereplace' also. Like others, I would prefer auto-register unless that creates a problem. If it is a problem, perhaps the registry mechanism needs improvement. On the other hand, it is it built-in, it will be pre-registered. -- Terry Jan Reedy -- https://mail.python.org/mailman/listinfo/python-list
Re: API for custom Unicode error handlers
04.10.13 16:56, Steven D'Aprano написав(ла): I have some custom Unicode error handlers, and I'm looking for advice on the right API for dealing with them. I have a module containing custom Unicode error handlers. For example: # Python 3 import unicodedata def namereplace_errors(exc): c = exc.object[exc.start] try: name = unicodedata.name(c) except (KeyError, ValueError): n = ord(c) if n <= 0x: replace = "\\u%04x" else: assert n <= 0x10 replace = "\\U%08x" replace = replace % n else: replace = "\\N{%s}" % name return replace, exc.start + 1 I'm planning to built this error handler in 3.4 (see http://comments.gmane.org/gmane.comp.python.ideas/21296). Actually Python implementation should looks like: def namereplace_errors(exc): if not isinstance(exc, UnicodeEncodeError): raise exc replace = [] for c in exc.object[exc.start:exc.end]: try: replace.append(r'\N{%s}' % unicodedata.name(c)) except KeyError: n = ord(c) if n < 0x100: replace.append(r'\x%02x' % n) elif n < 0x1: replace.append(r'\u%04x' % n) else: replace.append(r'\U%08x' % n) return ''.join(replace), exc.end Now, my question: Should the module holding the error handlers automatically register them? This question interesting me too. -- https://mail.python.org/mailman/listinfo/python-list
Re: API for custom Unicode error handlers
04.10.13 20:22, Chris Angelico написав(ла): I'd be quite happy with importing having a side-effect here. If you import a module that implements a numeric type, it should immediately register itself with the Numeric ABC, right? This is IMO equivalent to that. There is a difference. You can't use a numeric type without importing a module, but you can use error handler registered outside of your module. This leads to subtle bugs. Let the A module imports error_handlers and uses error handle. The module B uses error handle but doesn't import error_handlers. C.py imports A and B and all works. D.py imports B and A and fails. -- https://mail.python.org/mailman/listinfo/python-list
Re: API for custom Unicode error handlers
On 10/04/2013 06:56 AM, Steven D'Aprano wrote: Should the module holding the error handlers automatically register them? I think it should. Registration only needs to happen once, the module is useless without being registered, no threads nor processes are being started, and the only reason to import the module is to get the functionality... isn't it? What about help(), sphynx (sp?), or other introspection tools? This sounds similar to cgitb -- another module which you only import if you want the html'ized traceback, and yet it requires a separate cgitb.enable() call... I change my mind, it shouldn't. Throw in a .enable() function and call it good. :) -- ~Ethan~ -- https://mail.python.org/mailman/listinfo/python-list
Re: API for custom Unicode error handlers
On Fri, Oct 4, 2013 at 11:56 PM, Steven D'Aprano wrote: > Should the module holding the error handlers automatically register them? > In other words, if I do: > > import error_handlers > > just importing it will have the side-effect of registering the error > handlers. Normally, I dislike imports that have side-effects of this > sort, but I'm not sure that the alternative is better, that is, to put > responsibility on the caller to register some, or all, of the handlers: > > import error_handlers > error_handlers.register(error_handlers.namereplace_errors) > error_handlers.register_all() Caveat: I don't actually use codecs much, so I don't know the specifics. I'd be quite happy with importing having a side-effect here. If you import a module that implements a numeric type, it should immediately register itself with the Numeric ABC, right? This is IMO equivalent to that. > As far as I know, there is no way to find out what error handlers are > registered, and no way to deregister one after it has been registered. The only risk that I see is of an accidental collision. Having a codec registered that you don't use can't hurt (afaik). Is there any mechanism for detecting a name collision? If not, I wouldn't worry about it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
API for custom Unicode error handlers
I have some custom Unicode error handlers, and I'm looking for advice on the right API for dealing with them. I have a module containing custom Unicode error handlers. For example: # Python 3 import unicodedata def namereplace_errors(exc): c = exc.object[exc.start] try: name = unicodedata.name(c) except (KeyError, ValueError): n = ord(c) if n <= 0x: replace = "\\u%04x" else: assert n <= 0x10 replace = "\\U%08x" replace = replace % n else: replace = "\\N{%s}" % name return replace, exc.start + 1 Before I can use the error handler, I need to register it using this: import codecs codecs.register_error('namereplace', namereplace_errors) And now: py> 'abc\u04F1'.encode('ascii', 'namereplace') b'abc\\N{CYRILLIC SMALL LETTER U WITH DIAERESIS}' Now, my question: Should the module holding the error handlers automatically register them? In other words, if I do: import error_handlers just importing it will have the side-effect of registering the error handlers. Normally, I dislike imports that have side-effects of this sort, but I'm not sure that the alternative is better, that is, to put responsibility on the caller to register some, or all, of the handlers: import error_handlers error_handlers.register(error_handlers.namereplace_errors) error_handlers.register_all() As far as I know, there is no way to find out what error handlers are registered, and no way to deregister one after it has been registered. Which API would you prefer if you were using this module? -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
Le jeudi 8 novembre 2012 21:42:58 UTC+1, Ian a écrit : > On Thu, Nov 8, 2012 at 12:54 PM, wrote: > > > Font has nothing to do here. > > > You are "simply" wrongly encoding your "unicode". > > > > > '\u2013' > > > '–' > > '\u2013'.encode('utf-8') > > > b'\xe2\x80\x93' > > '\u2013'.encode('utf-8').decode('cp1252') > > > '–' > > > > No, it seriously is the font. This is what I get using the default > > ("Raster") font: > > > > C:\>chcp 65001 > > Active code page: 65001 > > > > C:\>c:\python33\python > > Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 > > 32 bit (Intel)] on win32 > > Type "help", "copyright", "credits" or "license" for more information. > > >>> '\u2013' > > '–' > > >>> import sys > > >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8')) > > – > > 4 > > > > I should note here that the characters copied and pasted do not > > correspond to the glyphs actually displayed in my terminal window. In > > the terminal window I actually see: > > > > ΓÇô > > > > If I change the font to Lucida Console and run the *exact same code*, > > I get this: > > > > C:\>chcp 65001 > > Active code page: 65001 > > > > C:\>c:\python33\python > > Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 > > 32 bit (Intel)] on win32 > > Type "help", "copyright", "credits" or "license" for more information. > > >>> '\u2013' > > '–' > > > > >>> import sys > > >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8')) > > – > > 4 > > > > Why is the font important? I have no idea. Blame Microsoft. - If you have something like this 'ΓÇô'; in Unicode nomenclature: >>> import unicodedata as ud >>> for c in 'ΓÇô': ... ud.name(c) ... 'GREEK CAPITAL LETTER GAMMA' 'LATIN CAPITAL LETTER C WITH CEDILLA' 'LATIN SMALL LETTER O WITH CIRCUMFLEX' it is a sign of a "cp437" somewhere. >>> '\u2013'.encode('utf-8').decode('cp437') 'ΓÇô' On Windows 7. I do not remember having once a "coding of the caracters" issue on XP. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 2012.11.08 08:06, Oscar Benjamin wrote: > It would be a lot better though if it just worked straight away > without me needing to set the code page (like the terminal in every > other OS I use). The crude equivalent of .bashrc/.zshrc/whatever shell startup script for cmd is setting a string value (REG_SZ) in HKCU\Software\Microsoft\Command Processor named autorun and setting that with whatever command(s) you want to run whenever the shell starts. Mine has a value of '@chcp 65001>nul'. I actually run zsh when practical (gotta love Cygwin) and I have an equivalent command in my .zshrc. Getting unicode to work in a Windows is a hassle, but it /can/ work. CPython does have a bug that makes it annoying at times, though - http://bugs.python.org/issue1602 -- CPython 3.3.0 | Windows NT 6.1.7601.17835 -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 8 November 2012 19:54, wrote: > Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit : >> On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin >> >> wrote: >> >> > If I want the other characters to work I need to change the code page: >> >> > >> >> > O:\>chcp 65001 >> >> > Active code page: 65001 >> >> > >> >> > O:\>Q:\tools\Python33\python -c "import sys; >> >> I find that I also need to change the font. With the default font, >> >> printing '\u2013' gives me: >> >> – >> >> >> >> The only alternative font option I have in Windows XP is Lucida >> >> Console, which at least works correctly, although it seems to be >> >> lacking a lot of glyphs. > > Font has nothing to do here. > You are "simply" wrongly encoding your "unicode". > '\u2013' > '–' '\u2013'.encode('utf-8') > b'\xe2\x80\x93' '\u2013'.encode('utf-8').decode('cp1252') > '–' You have correctly identified that the displayed characters are the result of accidentally interpreting utf-8 bytes as if they were cp1252 or similar. However, it is not Ian or Python that is confusing the encoding. It is cmd.exe that is confusing the encoding in a font-dependent way. I also had to change the font as Ian describes though I did it some time ago and forgot to mention it here. jmf, can you please trim the text you quote removing the parts you are not responding to and then any remaining blank lines that were inserted by your reader/editor? Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On Thu, Nov 8, 2012 at 1:54 PM, Prasad, Ramit wrote: > Why would font not matter? Unicode is the abstract definition > of all characters right? From that we map the abstract > character to a code page/set, which gives real values for an > abstract character. From that code page we then visually display > the "real value" based on the font. If that font does > not have a glyph for a specific character page (or a different > glyph) then that is a problem and not related encoding. Usually though when the font is missing a glyph for a Unicode character, you just get a missing glyph symbol, such as an empty rectangle. For some reason when using the default font, cmd seemingly ignores the active code page, skips decoding the characters, and tries to print the individual bytes as if using code page 437. -- http://mail.python.org/mailman/listinfo/python-list
RE: Right solution to unicode error?
wxjmfa...@gmail.com wrote: > > Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit : > > On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin > > > > wrote: > > > > > If I want the other characters to work I need to change the code page: > > > > > > O:\>chcp 65001 > > > Active code page: 65001 > > > > > > O:\>Q:\tools\Python33\python -c "import sys; > > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > > > α > > > > > > O:\>Q:\tools\Python33\python -c "import sys; > > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > > > coding))" > > > α > > > > I find that I also need to change the font. With the default font, > > > > printing '\u2013' gives me: > > – > > > > The only alternative font option I have in Windows XP is Lucida > > Console, which at least works correctly, although it seems to be > > lacking a lot of glyphs. > > > > Font has nothing to do here. > You are "simply" wrongly encoding your "unicode". > Why would font not matter? Unicode is the abstract definition of all characters right? From that we map the abstract character to a code page/set, which gives real values for an abstract character. From that code page we then visually display the "real value" based on the font. If that font does not have a glyph for a specific character page (or a different glyph) then that is a problem and not related encoding. Unicode->code page->font > >>> '\u2013' > '–' > >>> '\u2013'.encode('utf-8') > b'\xe2\x80\x93' > >>> '\u2013'.encode('utf-8').decode('cp1252') > '–' > This is a mismatched translation between code pages; not font related but is instead one abstraction "level" up. This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On Thu, Nov 8, 2012 at 12:54 PM, wrote: > Font has nothing to do here. > You are "simply" wrongly encoding your "unicode". > '\u2013' > '–' '\u2013'.encode('utf-8') > b'\xe2\x80\x93' '\u2013'.encode('utf-8').decode('cp1252') > '–' No, it seriously is the font. This is what I get using the default ("Raster") font: C:\>chcp 65001 Active code page: 65001 C:\>c:\python33\python Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> '\u2013' '–' >>> import sys >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8')) – 4 I should note here that the characters copied and pasted do not correspond to the glyphs actually displayed in my terminal window. In the terminal window I actually see: ΓÇô If I change the font to Lucida Console and run the *exact same code*, I get this: C:\>chcp 65001 Active code page: 65001 C:\>c:\python33\python Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> '\u2013' '–' >>> import sys >>> sys.stdout.buffer.write('\u2013\n'.encode('utf-8')) – 4 Why is the font important? I have no idea. Blame Microsoft. -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
Le jeudi 8 novembre 2012 19:49:24 UTC+1, Ian a écrit : > On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin > > wrote: > > > If I want the other characters to work I need to change the code page: > > > > > > O:\>chcp 65001 > > > Active code page: 65001 > > > > > > O:\>Q:\tools\Python33\python -c "import sys; > > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > > > α > > > > > > O:\>Q:\tools\Python33\python -c "import sys; > > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > > > coding))" > > > α > > > > I find that I also need to change the font. With the default font, > > printing '\u2013' gives me: > > > > – > > > > The only alternative font option I have in Windows XP is Lucida > > Console, which at least works correctly, although it seems to be > > lacking a lot of glyphs. Font has nothing to do here. You are "simply" wrongly encoding your "unicode". >>> '\u2013' '–' >>> '\u2013'.encode('utf-8') b'\xe2\x80\x93' >>> '\u2013'.encode('utf-8').decode('cp1252') '–' jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
Le jeudi 8 novembre 2012 19:32:14 UTC+1, Oscar Benjamin a écrit : > On 8 November 2012 15:05, wrote: > > > Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit : > > >> On 8 November 2012 00:44, Oscar Benjamin > >> wrote: > > >> > On 7 November 2012 23:51, Andrew Berg wrote: > > >> >> On 2012.11.07 17:27, Oscar Benjamin wrote: > > >> > > >> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not > > >> >>> support unicode > > >> > > >> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help > > >> >> the OP since Python versions below 3.3 don't support cp65001, but I > > >> >> think it's important to point out that the Windows command line system > > >> >> (it is not unique to cmd) does in fact support Unicode. > > >> > > >> > I have tried to use code page 65001 and it didn't work for me even if > > >> > I did use a version of Python (possibly 3.3 alpha) that claimed to > > >> > support it. > > >> > > >> I stand corrected. I've just checked and codepage 65001 does work in > > >> cmd.exe (on this machine): > > >> > > >> O:\>chcp 65001 > > >> Active code page: 65001 > > >> > > >> O:\>Q:\tools\Python33\python -c print('abc\u2013def') > > >> abc-def > > >> > > >> O:\>Q:\tools\Python33\python -c print('\u03b1') > > >> α > > >> > > >> It would be a lot better though if it just worked straight away > > >> without me needing to set the code page (like the terminal in every > > >> other OS I use). > > > > > > It *WORKS* straight away. The problem is that > > > people do not wish to use unicode correctly > > > (eg. Mulder's example). > > > Read the point 1) and 4) in my previous post. > > > > > > Unicode and in general the coding of the characters > > > have nothing to do with the os's or programming languages. > > > > I don't know what you mean that it works "straight away". > > > > The default code page on my machine is cp850. > > > > O:\>chcp > > Active code page: 850 > > > > cp850 doesn't understand utf-8. It just prints garbage: > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > > ╬▒ > > > > Using the correct encoding doesn't help: > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))" > > Traceback (most recent call last): > > File "", line 1, in > > File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode > > return codecs.charmap_encode(input,errors,encoding_map) > > UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in > > position 0: character maps to > > > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > > coding))" > > Traceback (most recent call last): > > File "", line 1, in > > File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode > > return codecs.charmap_encode(input,errors,encoding_map) > > UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in > > position 0: character maps to > > > > > > If I want the other characters to work I need to change the code page: > > > > O:\>chcp 65001 > > Active code page: 65001 > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > > α > > > > O:\>Q:\tools\Python33\python -c "import sys; > > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > > coding))" > > α > > > > > > Oscar You are confusing two things. The coding of the characters and the set of the characters (glyphes/graphemes) of a coding scheme. It is always possible to encode safely an unicode, but the target coding may not contain the character. Take a look at the output of this "special" interactive interpreter" where the host coding (sys.stdout.encoding) can be change on the fly. >>> s = 'éléphant\u2013abc需' >>> sys.stdout.encoding '' >>> s 'éléphant–abc需' >>> >>> sys.stdout.encoding = 'cp1252' >>> s.encode('cp1252') 'éléphant–abc需' >>> sys.stdout.encoding = 'cp850' >>> s.encode('cp850') Traceback (most recent call last): File "", line 1, in File "C:\Python32\lib\encodings\cp850.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 8: character maps to >>> # but >>> s.encode('cp850', 'replace') 'éléphant?abcé??' >>> >>> sys.stdout.encoding = 'utf-8' >>> s 'éléphant–abc需' >>> s.encode('utf-8') 'éléphant–abc需' >>> >>> sys.stdout.encoding = 'utf-16-le' < >>> s ' é l é p h a n t a b c é S ¬ ' >>> s.encode('utf-16-le') 'éléphant–abc需' <<< some cheating here do to the mail system, it really looks like this. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On Thu, Nov 8, 2012 at 11:32 AM, Oscar Benjamin wrote: > If I want the other characters to work I need to change the code page: > > O:\>chcp 65001 > Active code page: 65001 > > O:\>Q:\tools\Python33\python -c "import sys; > sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" > α > > O:\>Q:\tools\Python33\python -c "import sys; > sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en > coding))" > α I find that I also need to change the font. With the default font, printing '\u2013' gives me: – The only alternative font option I have in Windows XP is Lucida Console, which at least works correctly, although it seems to be lacking a lot of glyphs. -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 8 November 2012 15:05, wrote: > Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit : >> On 8 November 2012 00:44, Oscar Benjamin wrote: >> > On 7 November 2012 23:51, Andrew Berg wrote: >> >> On 2012.11.07 17:27, Oscar Benjamin wrote: >> >> >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not >> >>> support unicode >> >> >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help >> >> the OP since Python versions below 3.3 don't support cp65001, but I >> >> think it's important to point out that the Windows command line system >> >> (it is not unique to cmd) does in fact support Unicode. >> >> > I have tried to use code page 65001 and it didn't work for me even if >> > I did use a version of Python (possibly 3.3 alpha) that claimed to >> > support it. >> >> I stand corrected. I've just checked and codepage 65001 does work in >> cmd.exe (on this machine): >> >> O:\>chcp 65001 >> Active code page: 65001 >> >> O:\>Q:\tools\Python33\python -c print('abc\u2013def') >> abc-def >> >> O:\>Q:\tools\Python33\python -c print('\u03b1') >> α >> >> It would be a lot better though if it just worked straight away >> without me needing to set the code page (like the terminal in every >> other OS I use). > > It *WORKS* straight away. The problem is that > people do not wish to use unicode correctly > (eg. Mulder's example). > Read the point 1) and 4) in my previous post. > > Unicode and in general the coding of the characters > have nothing to do with the os's or programming languages. I don't know what you mean that it works "straight away". The default code page on my machine is cp850. O:\>chcp Active code page: 850 cp850 doesn't understand utf-8. It just prints garbage: O:\>Q:\tools\Python33\python -c "import sys; sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" ╬▒ Using the correct encoding doesn't help: O:\>Q:\tools\Python33\python -c "import sys; sys.stdout.buffer.write('\u03b1\n'.encode('cp850'))" Traceback (most recent call last): File "", line 1, in File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in position 0: character maps to O:\>Q:\tools\Python33\python -c "import sys; sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en coding))" Traceback (most recent call last): File "", line 1, in File "Q:\tools\Python33\lib\encodings\cp850.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in position 0: character maps to If I want the other characters to work I need to change the code page: O:\>chcp 65001 Active code page: 65001 O:\>Q:\tools\Python33\python -c "import sys; sys.stdout.buffer.write('\u03b1\n'.encode('utf-8'))" α O:\>Q:\tools\Python33\python -c "import sys; sys.stdout.buffer.write('\u03b1\n'.encode(sys.stdout.en coding))" α Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
Le jeudi 8 novembre 2012 15:07:23 UTC+1, Oscar Benjamin a écrit : > On 8 November 2012 00:44, Oscar Benjamin wrote: > > > On 7 November 2012 23:51, Andrew Berg wrote: > > >> On 2012.11.07 17:27, Oscar Benjamin wrote: > > >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not > > >>> support unicode > > >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help > > >> the OP since Python versions below 3.3 don't support cp65001, but I > > >> think it's important to point out that the Windows command line system > > >> (it is not unique to cmd) does in fact support Unicode. > > > > > > I have tried to use code page 65001 and it didn't work for me even if > > > I did use a version of Python (possibly 3.3 alpha) that claimed to > > > support it. > > > > I stand corrected. I've just checked and codepage 65001 does work in > > cmd.exe (on this machine): > > > > O:\>Q:\tools\Python33\python -c print('abc\u2013def') > > Traceback (most recent call last): > > File "", line 1, in > > File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode > > return codecs.charmap_encode(input,self.errors,encoding_map)[0] > > UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in > > position 3: character maps to > > > > > > O:\>chcp 65001 > > Active code page: 65001 > > > > O:\>Q:\tools\Python33\python -c print('abc\u2013def') > > abc-def > > > > > > O:\>Q:\tools\Python33\python -c print('\u03b1') > > α > > > > It would be a lot better though if it just worked straight away > > without me needing to set the code page (like the terminal in every > > other OS I use). > > > > > > Oscar -- It *WORKS* straight away. The problem is that people do not wish to use unicode correctly (eg. Mulder's example). Read the point 1) and 4) in my previous post. Unicode and in general the coding of the characters have nothing to do with the os's or programming languages. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 8 November 2012 00:44, Oscar Benjamin wrote: > On 7 November 2012 23:51, Andrew Berg wrote: >> On 2012.11.07 17:27, Oscar Benjamin wrote: >>> Are you using cmd.exe (standard Windows terminal)? If so, it does not >>> support unicode >> Actually, it does. Code page 65001 is UTF-8. I know that doesn't help >> the OP since Python versions below 3.3 don't support cp65001, but I >> think it's important to point out that the Windows command line system >> (it is not unique to cmd) does in fact support Unicode. > > I have tried to use code page 65001 and it didn't work for me even if > I did use a version of Python (possibly 3.3 alpha) that claimed to > support it. I stand corrected. I've just checked and codepage 65001 does work in cmd.exe (on this machine): O:\>Q:\tools\Python33\python -c print('abc\u2013def') Traceback (most recent call last): File "", line 1, in File "Q:\tools\Python33\lib\encodings\cp850.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 3: character maps to O:\>chcp 65001 Active code page: 65001 O:\>Q:\tools\Python33\python -c print('abc\u2013def') abc-def O:\>Q:\tools\Python33\python -c print('\u03b1') α It would be a lot better though if it just worked straight away without me needing to set the code page (like the terminal in every other OS I use). Oscar -- http://mail.python.org/mailman/listinfo/python-list
RE: Right solution to unicode error?
Thanks, Oscar and Ramit! This is exactly what I was looking for. Anders > -Original Message- > From: Oscar Benjamin [mailto:oscar.j.benja...@gmail.com] > Sent: Wednesday, November 07, 2012 6:27 PM > To: Anders Schneiderman > Cc: python-list@python.org > Subject: Re: Right solution to unicode error? > > On 7 November 2012 22:17, Anders wrote: > > > > Traceback (most recent call last): > > File "outlook_tasks.py", line 66, in > > my_tasks.dump_today_tasks() > > File "C:\Users\Anders\code\Task List\tasks.py", line 29, in > > dump_today_tasks > > print task.subject > > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in > > position 42: ordinal not in range(128) > > > > Here's where I'm getting stuck. In the code above I was just printing > > the subject so I can see whether the script is working properly. > > Ultimately what I want to do is parse the tasks I'm interested in and > > then create an HTML file containing those tasks. Given that, what's > > the best way to fix this problem? > > Are you using cmd.exe (standard Windows terminal)? If so, it does not > support unicode and Python is telling you that it cannot encode the string in > a > way that can be understood by your terminal. You can try using chcp to set > the code page to something that works with your script. > > If you are only printing it for debugging purposes you can just print the > repr() > of the string which will be ascii and will come out fine in your terminal. If > you > want to write it to a html file you should encode the string with whatever > encoding (probably utf-8) you use in the html file. If you really just want > your > script to be able to print unicode characters then you need to use something > other than cmd.exe (such as IDLE). > > > Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 8/11/12 00:53:49, Steven D'Aprano wrote: > This error confuses me. Is that an exact copy and paste of the error, or > have you edited it or reconstructed it? Because it seems to me that if > task.subject is a unicode string, as it appears to be, calling print on > it should succeed: > > py> s = u'ABC\u2013DEF' > py> print s > ABC–DEF That would depend on whether python thinks sys.stdout can handle UTF8. For example, on my MacOS X box: $ python2.6 -c 'print u"abc\u2013def"' abc–def $ python2.6 -c 'print u"abc\u2013def"' | cat Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3: ordinal not in range(128) This is because python knows that my terminal is capable of handling UTF8, but it has no idea whether the program at the other end of a pipe had that ability, so it'll fall back to ASCII only if sys.stdout goes to a pipe. Apparently the OP has a terminal that doesn't handle UTF8, or one that Python doesn't know about. Hope this helps, -- HansM -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
Le mercredi 7 novembre 2012 23:17:42 UTC+1, Anders a écrit : > I've run into a Unicode error, and despite doing some googling, I > > can't figure out the right way to fix it. I have a Python 2.6 script > > that reads my Outlook 2010 task list. I'm able to read the tasks from > > Outlook and store them as a list of objects without a hitch. But when > > I try to print the tasks' subjects, one of the tasks is generating an > > error: > > > > Traceback (most recent call last): > > File "outlook_tasks.py", line 66, in > > my_tasks.dump_today_tasks() > > File "C:\Users\Anders\code\Task List\tasks.py", line 29, in > > dump_today_tasks > > print task.subject > > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in > > position 42: ordinal not in range(128) > > > > (where task.subject was previously assigned the value of > > task.Subject, aka the Subject property of an Outlook 2010 TaskItem) > > > > From what I understand from reading online, the error is telling me > > that the subject line contains an en dash and that Python is trying > > to convert to ascii and failing (as it should). > > > > Here's where I'm getting stuck. In the code above I was just printing > > the subject so I can see whether the script is working properly. > > Ultimately what I want to do is parse the tasks I'm interested in and > > then create an HTML file containing those tasks. Given that, what's > > the best way to fix this problem? > > > > BTW, if there's a clear description of the best solution for this > > particular problem – i.e., where I want to ultimately display the > > results as HTML – please feel free to refer me to the link. I tried > > reading a number of docs on the web but still feel pretty lost. > > > > Thanks, > > Anders -- The problem is not on the Python side or specific to Python. It is on the side of the "coding of characters". 1) Unicode is an abstract entity, it has to be encoded for the system/device that will host it. Using Python: .encode(host_coding) 2) The host_coding scheme may not contain the character (glyph/grapheme) corresponding to the "unicode character". In that case, 2 possible solutions, "ignore" it ou "replace" it with a substitution character. Using Python: .encode(host_coding, "ignore") .encode(host_coding, "replace") 3) Detecting the host_coding, the most difficult task. Either you have to hard-code it or you may expect Python find it via its sys.encoding. 4) Due to the nature of unicode, it the unique way to do it correctly. Expectedly failing and not failing examples. Mainly Py3, but it doesn't matter. Note: Py3 encodes and creates a byte string, which has to be decoded to produce a native (unicode) string, here with cp1252. Py2 >>> u'éléphant\u2013abc'.encode('ascii') Traceback (most recent call last): File "", line 1, in u'éléphant\u2013abc'.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128) >>> print(u'éléphant\u2013abc'.encode('cp1252')) éléphant–abc >>> Py3 >>> 'éléphant\u2013abc'.encode('ascii') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128) >>> 'éléphant\u2013abc'.encode('ascii', 'ignore') b'lphantabc' >>> 'éléphant\u2013abc'.encode('ascii', 'replace') b'?l?phant?abc' >>> 'éléphant\u2013abc'.encode('ascii', 'ignore').decode('cp1252') 'lphantabc' >>> 'éléphant\u2013abc'.encode('ascii', 'replace').decode('cp1252') '?l?phant?abc' >>> >>> 'éléphant\u2013abc'.encode('cp1252').decode('cp1252') 'éléphant–abc' >>> sys.stdout.encoding 'cp1252' >>> 'éléphant\u2013abc'.encode(sys.stdout.encoding).decode('cp1252') 'éléphant–abc' etc jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 7 November 2012 23:51, Andrew Berg wrote: > On 2012.11.07 17:27, Oscar Benjamin wrote: >> Are you using cmd.exe (standard Windows terminal)? If so, it does not >> support unicode > Actually, it does. Code page 65001 is UTF-8. I know that doesn't help > the OP since Python versions below 3.3 don't support cp65001, but I > think it's important to point out that the Windows command line system > (it is not unique to cmd) does in fact support Unicode. I have tried to use code page 65001 and it didn't work for me even if I did use a version of Python (possibly 3.3 alpha) that claimed to support it. It turned out that there were other Windows related problems with using the codepage so that I had to do something like chcp 65001 && python myscript.py && chcp 2521 (It was important for all those commands to be on the same line) I'm not on Windows right now and I can't remember all the details but I seem to remember that even with that awkwardness and changing the font it still didn't actually work. If you know how to make it work, I'd be interested to know. Oscar -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On Wed, 07 Nov 2012 14:17:42 -0800, Anders wrote: > I've run into a Unicode error, and despite doing some googling, I can't > figure out the right way to fix it. I have a Python 2.6 script that > reads my Outlook 2010 task list. I'm able to read the tasks from Outlook > and store them as a list of objects without a hitch. But when I try to > print the tasks' subjects, one of the tasks is generating an error: > > Traceback (most recent call last): > File "outlook_tasks.py", line 66, in > my_tasks.dump_today_tasks() > File "C:\Users\Anders\code\Task List\tasks.py", line 29, in > dump_today_tasks > print task.subject > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in > position 42: ordinal not in range(128) This error confuses me. Is that an exact copy and paste of the error, or have you edited it or reconstructed it? Because it seems to me that if task.subject is a unicode string, as it appears to be, calling print on it should succeed: py> s = u'ABC\u2013DEF' py> print s ABC–DEF What does type(task.subject) return? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 2012.11.07 17:27, Oscar Benjamin wrote: > Are you using cmd.exe (standard Windows terminal)? If so, it does not > support unicode Actually, it does. Code page 65001 is UTF-8. I know that doesn't help the OP since Python versions below 3.3 don't support cp65001, but I think it's important to point out that the Windows command line system (it is not unique to cmd) does in fact support Unicode. -- CPython 3.3.0 | Windows NT 6.1.7601.17835 -- http://mail.python.org/mailman/listinfo/python-list
Re: Right solution to unicode error?
On 7 November 2012 22:17, Anders wrote: > > Traceback (most recent call last): > File "outlook_tasks.py", line 66, in > my_tasks.dump_today_tasks() > File "C:\Users\Anders\code\Task List\tasks.py", line 29, in > dump_today_tasks > print task.subject > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in > position 42: ordinal not in range(128) > > Here's where I'm getting stuck. In the code above I was just printing > the subject so I can see whether the script is working properly. > Ultimately what I want to do is parse the tasks I'm interested in and > then create an HTML file containing those tasks. Given that, what's > the best way to fix this problem? Are you using cmd.exe (standard Windows terminal)? If so, it does not support unicode and Python is telling you that it cannot encode the string in a way that can be understood by your terminal. You can try using chcp to set the code page to something that works with your script. If you are only printing it for debugging purposes you can just print the repr() of the string which will be ascii and will come out fine in your terminal. If you want to write it to a html file you should encode the string with whatever encoding (probably utf-8) you use in the html file. If you really just want your script to be able to print unicode characters then you need to use something other than cmd.exe (such as IDLE). Oscar -- http://mail.python.org/mailman/listinfo/python-list
RE: Right solution to unicode error?
Anders wrote: > > I've run into a Unicode error, and despite doing some googling, I > can't figure out the right way to fix it. I have a Python 2.6 script > that reads my Outlook 2010 task list. I'm able to read the tasks from > Outlook and store them as a list of objects without a hitch. But when > I try to print the tasks' subjects, one of the tasks is generating an > error: > > Traceback (most recent call last): > File "outlook_tasks.py", line 66, in > my_tasks.dump_today_tasks() > File "C:\Users\Anders\code\Task List\tasks.py", line 29, in > dump_today_tasks > print task.subject > UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in > position 42: ordinal not in range(128) > > (where task.subject was previously assigned the value of > task.Subject, aka the Subject property of an Outlook 2010 TaskItem) > > From what I understand from reading online, the error is telling me > that the subject line contains an en dash and that Python is trying > to convert to ascii and failing (as it should). > > Here's where I'm getting stuck. In the code above I was just printing > the subject so I can see whether the script is working properly. > Ultimately what I want to do is parse the tasks I'm interested in and > then create an HTML file containing those tasks. Given that, what's > the best way to fix this problem? > > BTW, if there's a clear description of the best solution for this > particular problem - i.e., where I want to ultimately display the > results as HTML - please feel free to refer me to the link. I tried > reading a number of docs on the web but still feel pretty lost. > You can always encode in a non-ASCII codec. `print task.subject.encode()` where is something that supports the characters you want e.g. latin1. The list of built in codecs can be found: http://docs.python.org/library/codecs.html#standard-encodings ~Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
Right solution to unicode error?
I've run into a Unicode error, and despite doing some googling, I can't figure out the right way to fix it. I have a Python 2.6 script that reads my Outlook 2010 task list. I'm able to read the tasks from Outlook and store them as a list of objects without a hitch. But when I try to print the tasks' subjects, one of the tasks is generating an error: Traceback (most recent call last): File "outlook_tasks.py", line 66, in my_tasks.dump_today_tasks() File "C:\Users\Anders\code\Task List\tasks.py", line 29, in dump_today_tasks print task.subject UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 42: ordinal not in range(128) (where task.subject was previously assigned the value of task.Subject, aka the Subject property of an Outlook 2010 TaskItem) >From what I understand from reading online, the error is telling me that the subject line contains an en dash and that Python is trying to convert to ascii and failing (as it should). Here's where I'm getting stuck. In the code above I was just printing the subject so I can see whether the script is working properly. Ultimately what I want to do is parse the tasks I'm interested in and then create an HTML file containing those tasks. Given that, what's the best way to fix this problem? BTW, if there's a clear description of the best solution for this particular problem – i.e., where I want to ultimately display the results as HTML – please feel free to refer me to the link. I tried reading a number of docs on the web but still feel pretty lost. Thanks, Anders -- http://mail.python.org/mailman/listinfo/python-list
Re: Why are some unicode error handlers "encode only"?
On 3/11/2012 10:37 AM, Steven D'Aprano wrote: At least two standard error handlers are documented as working for encoding only: xmlcharrefreplace backslashreplace See http://docs.python.org/library/codecs.html#codec-base-classes and http://docs.python.org/py3k/library/codecs.html Why is this? I presume the purpose of both is to facilitate transmission of unicode text via byte transmission by extending incomplete byte encodings by replacing unicode chars that do not fit in the given encoding by a ascii byte sequence that will fit. I don't see why they shouldn't work for decoding as well. Consider this example using Python 3.2: b"aaa--\xe9z--\xe9!--bbb".decode("cp932") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: illegal multibyte sequence The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't or can't be supported? # This doesn't actually work. b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace") => r'aaa--騷--\xe9\x21--bbb' This output does not round-trip and would be a bit of a fib since it somewhat misrepresents what the encoded bytes were: >>> r'aaa--騷--\xe9\x21--bbb'.encode("cp932") b'aaa--\xe9z--\\xe9\\x21--bbb' >>> b'aaa--\xe9z--\\xe9\\x21--bbb'.decode("cp932") 'aaa--騷--\\xe9\\x21--bbb' Python 3 added surrogateescape error handling to solve this problem. and similarly for xmlcharrefreplace. Since xml character references are representations of unicode chars, and not bytes, I do not see how that would work. By analogy, perhaps you mean to have 'e9;' in your output instead of '\xe9\x21', but those would not properly be xml numeric character references. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Why are some unicode error handlers "encode only"?
On 11.03.12 15:37, Steven D'Aprano wrote: At least two standard error handlers are documented as working for encoding only: xmlcharrefreplace backslashreplace See http://docs.python.org/library/codecs.html#codec-base-classes and http://docs.python.org/py3k/library/codecs.html Why is this? I don't see why they shouldn't work for decoding as well. Because xmlcharrefreplace and backslashreplace are *error* handlers. However the bytes sequence b'〹' does *not* contain any bytes that are not decodable for e.g. the ASCII codec. So there are no errors to handle. Consider this example using Python 3.2: b"aaa--\xe9z--\xe9!--bbb".decode("cp932") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: illegal multibyte sequence The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't or can't be supported? The byte sequence b'\xe9!' however is not something that would have been produced by the backslashreplace error handler. b'\\xe9!' (a sequence containing 5 bytes) would have been (and this probably would decode without any problems with the cp932 codec). # This doesn't actually work. b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace") => r'aaa--騷--\xe9\x21--bbb' and similarly for xmlcharrefreplace. This would require a postprocess step *after* the bytes have been decoded. This is IMHO out of scope for Python's codec machinery. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Why are some unicode error handlers "encode only"?
At least two standard error handlers are documented as working for encoding only: xmlcharrefreplace backslashreplace See http://docs.python.org/library/codecs.html#codec-base-classes and http://docs.python.org/py3k/library/codecs.html Why is this? I don't see why they shouldn't work for decoding as well. Consider this example using Python 3.2: >>> b"aaa--\xe9z--\xe9!--bbb".decode("cp932") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'cp932' codec can't decode bytes in position 9-10: illegal multibyte sequence The two bytes b'\xe9!' is an illegal multibyte sequence for CP-932 (also known as MS-KANJI or SHIFT-JIS). Is there some reason why this shouldn't or can't be supported? # This doesn't actually work. b"aaa--\xe9z--\xe9!--bbb".decode("cp932", "backslashreplace") => r'aaa--騷--\xe9\x21--bbb' and similarly for xmlcharrefreplace. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert wrote: > On Tue, Feb 8, 2011 at 7:57 AM, Rickard Lindberg wrote: >> Hi, >> >> Here is a bash script to reproduce my error: > > Including the error message and traceback is still helpful, for future > reference. > >> #!/bin/sh >> >> cat > å.timeline < >> EOF >> >> python <> # encoding: utf-8 >> from xml.sax import parse >> from xml.sax.handler import ContentHandler >> parse(u"å.timeline", ContentHandler()) >> EOF >> >> If I instead do >> >> parse(u"å.timeline".encode("utf-8"), ContentHandler()) >> >> the script runs without errors. >> >> Is this a bug or expected behavior? > > Bug; open() figures out the filesystem encoding just fine. > Bug tracker to report the issue to: http://bugs.python.org/ > > Workaround: > parse(open(u"å.timeline", 'r'), ContentHandler()) > > Cheers, > Chris Bug reported at http://bugs.python.org/issue11159 -- Rickard Lindberg -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
Rickard Lindberg, 09.02.2011 14:01: Did you read my reply? Sorry, it was me who failed to read your question properly. Unicode file names aren't really working well, especially not in Py2.x. Python 3.2 provides many improvements here. I assume your file system encoding is UTF-8? What does sys.getfilesystemencoding() give you? My getfilesystemencoding() returns utf-8. Ok, same here. I tried it with Python 3.1.2 and it works for me. So I think the right work-around for you in Python 2 is to encode the file name using whatever "sys.getfilesystemencoding()" returns. And I agree with Chris Rebert that you should open a bug against the sax package in Python 2.7 on the bug tracker. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
>> Did you read my reply? > >Sorry, it was me who failed to read your question properly. > >Unicode file names aren't really working well, especially not in Py2.x. >Python 3.2 provides many improvements here. > >I assume your file system encoding is UTF-8? What does >sys.getfilesystemencoding() give you? > >Stefan Since I'm not registered on the Python mailing list I had some trouble replying to your message. My getfilesystemencoding() returns utf-8. -- Rickard Lindberg -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
Stefan Behnel, 09.02.2011 09:58: Rickard Lindberg, 09.02.2011 09:32: On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert wrote: Here is a bash script to reproduce my error: Including the error message and traceback is still helpful, for future reference. Thanks for pointing it out. #!/bin/sh cat> å.timeline< EOF python< Bug; open() figures out the filesystem encoding just fine. Bug tracker to report the issue to: http://bugs.python.org/ Workaround: parse(open(u"å.timeline", 'r'), ContentHandler()) When I tried your workaround, I still got this error: Traceback (most recent call last): File "", line 4, in File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse parser.parse(filename_or_stream) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse self.prepareParser(source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser self._parser.SetBase(source.getSystemId()) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128) The open(..) part works fine, but there still seems to be a problem inside the sax parser. Did you read my reply? Sorry, it was me who failed to read your question properly. Unicode file names aren't really working well, especially not in Py2.x. Python 3.2 provides many improvements here. I assume your file system encoding is UTF-8? What does sys.getfilesystemencoding() give you? Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
Rickard Lindberg, 09.02.2011 09:32: On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert wrote: Here is a bash script to reproduce my error: Including the error message and traceback is still helpful, for future reference. Thanks for pointing it out. #!/bin/sh cat> å.timeline< EOF python< Bug; open() figures out the filesystem encoding just fine. Bug tracker to report the issue to: http://bugs.python.org/ Workaround: parse(open(u"å.timeline", 'r'), ContentHandler()) When I tried your workaround, I still got this error: Traceback (most recent call last): File "", line 4, in File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse parser.parse(filename_or_stream) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse self.prepareParser(source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser self._parser.SetBase(source.getSystemId()) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128) The open(..) part works fine, but there still seems to be a problem inside the sax parser. Did you read my reply? Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
On Tue, Feb 8, 2011 at 5:41 PM, Chris Rebert wrote: >> Here is a bash script to reproduce my error: > > Including the error message and traceback is still helpful, for future > reference. Thanks for pointing it out. >> #!/bin/sh >> >> cat > å.timeline < >> EOF >> >> python <> # encoding: utf-8 >> from xml.sax import parse >> from xml.sax.handler import ContentHandler >> parse(u"å.timeline", ContentHandler()) >> EOF >> >> If I instead do >> >> parse(u"å.timeline".encode("utf-8"), ContentHandler()) >> >> the script runs without errors. >> >> Is this a bug or expected behavior? > > Bug; open() figures out the filesystem encoding just fine. > Bug tracker to report the issue to: http://bugs.python.org/ > > Workaround: > parse(open(u"å.timeline", 'r'), ContentHandler()) When I tried your workaround, I still got this error: Traceback (most recent call last): File "", line 4, in File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/__init__.py", line 31, in parse parser.parse(filename_or_stream) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/xmlreader.py", line 119, in parse self.prepareParser(source) File "/usr/lib64/python2.7/site-packages/_xmlplus/sax/expatreader.py", line 121, in prepareParser self._parser.SetBase(source.getSystemId()) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128) The open(..) part works fine, but there still seems to be a problem inside the sax parser. -- Rickard Lindberg -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
Rickard Lindberg, 08.02.2011 16:57: Hi, Here is a bash script to reproduce my error: #!/bin/sh cat> å.timeline< 0.13.0devb38ace0a572b+ 2011-02-01 00:00:00 2011-02-03 08:46:00 asdsd 2011-01-24 16:38:11 2011-02-23 16:38:11 EOF python< Expected behaviour. You cannot parse XML from unicode strings, especially not when the XML data explicitly declares itself as being encoded in UTF-8. Parse from a byte string instead, as you do in your fixed code. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in sax parser
On Tue, Feb 8, 2011 at 7:57 AM, Rickard Lindberg wrote: > Hi, > > Here is a bash script to reproduce my error: Including the error message and traceback is still helpful, for future reference. > #!/bin/sh > > cat > å.timeline < > EOF > > python < # encoding: utf-8 > from xml.sax import parse > from xml.sax.handler import ContentHandler > parse(u"å.timeline", ContentHandler()) > EOF > > If I instead do > > parse(u"å.timeline".encode("utf-8"), ContentHandler()) > > the script runs without errors. > > Is this a bug or expected behavior? Bug; open() figures out the filesystem encoding just fine. Bug tracker to report the issue to: http://bugs.python.org/ Workaround: parse(open(u"å.timeline", 'r'), ContentHandler()) Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list
Unicode error in sax parser
Hi, Here is a bash script to reproduce my error: #!/bin/sh cat > å.timeline < 0.13.0devb38ace0a572b+ 2011-02-01 00:00:00 2011-02-03 08:46:00 asdsd 2011-01-24 16:38:11 2011-02-23 16:38:11 EOF python
Re: Unicode error
In <4c5d4ad9$0$28666$c3e8...@news.astraweb.com> Steven D'Aprano writes: >On Sat, 07 Aug 2010 19:28:56 +1200, Gregory Ewing wrote: >> Steven D'Aprano wrote: >>> "No memory? No disk space? No problem! Just a flesh wound!" What's >>> the point of that? >> >> +1 QOTW >While I'm always happy to be nominated for QOTW, in this case I didn't >say it, and the nomination should go to KJ. (The ol' "insert Monty Python reference" move: it never fails...) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
Steven D'Aprano wrote: "No memory? No disk space? No problem! Just a flesh wound!" What's the point of that? +1 QOTW -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, 06 Aug 2010 11:23:50 +, kj wrote: > I don't get your point. Even when I *know* that a certain exception may > happen, I don't necessarily catch it. I catch only those exceptions for > which I can think of a suitable response that is *different* from just > letting the program fail. (After all, my own code raises its own > exceptions with the precise intention of making the program fail.) If > an unexpected exception occurs, then by definition, I had no better > response in mind for that situation than just letting the program fail, > so I'm happy to let that happen. If, afterwards, I think of a different > response for a previously uncaught exception, I'll modify the code > accordingly. > > I find this approach far preferable to the alternative of knowing a long > list of possible exceptions (some of which may never happen in actual > practice), and think of ways to keep the program still alive > no-matter-what. "No memory? No disk space? No problem! Just a flesh > wound!" What's the point of that? /me cheers wildly! Well said! -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
In Nobody writes: >On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote: >> Don't write bare excepts, always catch the error you want and nothing >> else. >That advice would make more sense if it was possible to know which >exceptions could be raised. In practice, that isn't possible, as the >documentation seldom provides this information. Even for the built-in >classes, the documentation is weak in this regard; for less important >modules and third-party libraries, it's entirely absent. I don't get your point. Even when I *know* that a certain exception may happen, I don't necessarily catch it. I catch only those exceptions for which I can think of a suitable response that is *different* from just letting the program fail. (After all, my own code raises its own exceptions with the precise intention of making the program fail.) If an unexpected exception occurs, then by definition, I had no better response in mind for that situation than just letting the program fail, so I'm happy to let that happen. If, afterwards, I think of a different response for a previously uncaught exception, I'll modify the code accordingly. I find this approach far preferable to the alternative of knowing a long list of possible exceptions (some of which may never happen in actual practice), and think of ways to keep the program still alive no-matter-what. "No memory? No disk space? No problem! Just a flesh wound!" What's the point of that? (If I want the final error message to be something other than a bare stack trace, I may wrap the whole execution in a global/top-level try/catch block so that I can fashion a suitable error message right before calling exit, but that's just "softening the fall": the program still will go down.) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
In article , Nobody wrote: > >Java's checked exception mechanism was based on real-world experience of >the pitfalls of abstract types. And that experience was gained in >environments where interface specifications were far more detailed than is >the norm in the Python world. There are a number of people who claim that checked exceptions are the wrong answer: http://www.mindview.net/Etc/Discussions/CheckedExceptions -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ "Normal is what cuts off your sixth finger and your tail..." --Siobhan -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Sun, 25 Jul 2010 14:47:11 +, Steven D'Aprano wrote: >>> But in the >>> meanwhile, once you get an error, you know what it is. You can >>> intentionally feed code bad data and see what you get. And then maybe >>> add a test to make sure your code traps such errors. >> >> That doesn't really help with exceptions which are triggered by external >> factors rather than explicit inputs. > > Huh? What do you mean by "external factors"? I mean this: > If you mean external factors like "the network goes down" or "the disk is > full", > you can still test for those with appropriate test doubles (think > "stunt doubles", only for testing) such as stubs or mocks. It's a little > bit more work (sometimes a lot more work), but it can be done. I'd say "a lot" is more often the case. >> Also, if you're writing libraries (rather than self-contained programs), >> you have no control over the arguments. > > You can't control what the caller passes to you, but once you have it, > you have total control over it. Total control insofar as you can wrap all method calls in semi-bare excepts (i.e. catch any Exception but not Interrupt). >> Coupled with the fact that duck >> typing is quite widely advocated in Python circles, you're stuck with >> the possibility that any method call on any argument can raise any >> exception. This is even true for calls to standard library functions or >> methods of standard classes if you're passing caller-supplied objects as >> arguments. > > That's a gross exaggeration. It's true that some methods could in theory > raise any exception, but in practice most exceptions are vanishingly > rare. Now *that* is a gross exaggeration. Exceptions are by their nature exceptional, in some sense of the word. But a substantial part of Python development is playing whac-a-mole with exceptions. Write code, run code, get traceback, either fix the cause (LBYL) or handle the exception (EAFP), wash, rinse, repeat. > And it isn't even remotely correct that "any" method could raise > anything. If you can get something other than NameError, ValueError or > TypeError by calling "spam".index(arg), I'd like to see it. How common is it to call methods on a string literal in real-world code? It's far, far more common to call methods on an argument or expression whose value could be any "string-like object" (e.g. UserString or a str subclass). IOW, it's "almost" correct that any method can raise any exception. The fact that the number of counter-examples is non-zero doesn't really change this. Even an isinstance() check won't help, as nothing prohibits a subclass from raising exceptions which the original doesn't. Even using "type(x) == sometype" doesn't help if x's methods involve calling methods of user-supplied values (unless those methods are wrapped in catch-all excepts). Java's checked exception mechanism was based on real-world experience of the pitfalls of abstract types. And that experience was gained in environments where interface specifications were far more detailed than is the norm in the Python world. > Frankly, it sounds to me that you're over-analysing all the things that > "could" go wrong rather than focusing on the things that actually do go > wrong. See Murphy's Law. > That's your prerogative, of course, but I don't think you'll get > much support for it here. Alas, I suspect that you're correct. Which is why I don't advocate using Python for "serious" software. Neither the language nor its "culture" are amenable to robustness. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Sun, 25 Jul 2010 13:52:33 +0100, Nobody wrote: > On Fri, 23 Jul 2010 18:27:50 -0400, Terry Reedy wrote: > >> But in the >> meanwhile, once you get an error, you know what it is. You can >> intentionally feed code bad data and see what you get. And then maybe >> add a test to make sure your code traps such errors. > > That doesn't really help with exceptions which are triggered by external > factors rather than explicit inputs. Huh? What do you mean by "external factors"? Do you mean like power supply fluctuations, cosmic rays flipping bits in memory, bad hardware? You can't defend against that, not without specialist fault-tolerant hardware, so just don't worry about it. If you mean external factors like "the network goes down" or "the disk is full", you can still test for those with appropriate test doubles (think "stunt doubles", only for testing) such as stubs or mocks. It's a little bit more work (sometimes a lot more work), but it can be done. Or don't worry about it. Release early, release often, and take lots of logs. You'll soon learn what exceptions can happen and what can't. Your software is still useful even when it's not perfect, and there's always time for another bug fix release. > Also, if you're writing libraries (rather than self-contained programs), > you have no control over the arguments. You can't control what the caller passes to you, but once you have it, you have total control over it. You can reject it with an exception, stick it inside a wrapper object, convert it to something else, deal with it as best you can, or just ignore it. > Coupled with the fact that duck > typing is quite widely advocated in Python circles, you're stuck with > the possibility that any method call on any argument can raise any > exception. This is even true for calls to standard library functions or > methods of standard classes if you're passing caller-supplied objects as > arguments. That's a gross exaggeration. It's true that some methods could in theory raise any exception, but in practice most exceptions are vanishingly rare. And it isn't even remotely correct that "any" method could raise anything. If you can get something other than NameError, ValueError or TypeError by calling "spam".index(arg), I'd like to see it. Frankly, it sounds to me that you're over-analysing all the things that "could" go wrong rather than focusing on the things that actually do go wrong. That's your prerogative, of course, but I don't think you'll get much support for it here. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, 23 Jul 2010 18:27:50 -0400, Terry Reedy wrote: > But in the > meanwhile, once you get an error, you know what it is. You can > intentionally feed code bad data and see what you get. And then maybe > add a test to make sure your code traps such errors. That doesn't really help with exceptions which are triggered by external factors rather than explicit inputs. Also, if you're writing libraries (rather than self-contained programs), you have no control over the arguments. Coupled with the fact that duck typing is quite widely advocated in Python circles, you're stuck with the possibility that any method call on any argument can raise any exception. This is even true for calls to standard library functions or methods of standard classes if you're passing caller-supplied objects as arguments. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
dirknbr gmail.com> writes: > I have kind of developped this but obviously it's not nice, any better > ideas? > > try: > text=texts[i] > text=text.encode('latin-1') > text=text.encode('utf-8') > except: > text=' ' As Steven has pointed out, if the .encode('latin-1') works, the result is thrown away. This would be very fortunate. It appears that your goal was to encode the text in latin1 if possible, otherwise in UTF-8, with no indication of which encoding was used. Your second posting confirmed that you were doing this in a loop, ending up with the possibility that your output file would have records with mixed encodings. Did you consider what a programmer writing code to READ your output file would need to do, e.g. attempt to decode each record as UTF-8 with a fall-back to latin1??? Did you consider what would be the result of sending a stream of mixed-encoding text to a display device? As already advised, the short answer to avoid all of that hassle; just encode in UTF-8. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, 23 Jul 2010 22:46:46 +0100, Nobody wrote: > On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote: > >> Don't write bare excepts, always catch the error you want and nothing >> else. > > That advice would make more sense if it was possible to know which > exceptions could be raised. In practice, that isn't possible, as the > documentation seldom provides this information. Even for the built-in > classes, the documentation is weak in this regard; for less important > modules and third-party libraries, it's entirely absent. Aside: that's an awfully sweeping generalisation for all third-party libraries. Yes, the documentation is sometimes weak, but that doesn't stop you from being sensible. Catching any exception, no matter what, whether you've heard of it or seen it before or not, is almost never a good idea. The two problems with bare excepts are: * They mask user generated keyboard interrupts, which is rude. * They hide unexpected errors and disguise them as expected errors. You want unexpected errors to raise an exception as early as possible, because they probably indicate a bug in your code, and the earlier you see the exception, the easier it is to debug. And even if they don't indicate a bug in your code, but merely an under- documented function, it's still better to find out what that is rather than sweep it under the carpet. You will have learned something new ("oh, the httplib functions can raise socket.error as well can they?") which makes you a better programmer, you have the opportunity to improve the documentation, you might want to handle it differently ("should I try again, or just give up now, or reset the flubbler?"). If you decide to just mask the exception, rather than handle it in some other way, it is easy enough to add an extra check to the except clause. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On 7/23/2010 5:46 PM, Nobody wrote: On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote: Don't write bare excepts, always catch the error you want and nothing else. That advice would make more sense if it was possible to know which exceptions could be raised. In practice, that isn't possible, as the documentation seldom provides this information. Even for the built-in classes, the documentation is weak in this regard; for less important modules and third-party libraries, it's entirely absent. I intend to bring that issue up on pydev list sometime. But in the meanwhile, once you get an error, you know what it is. You can intentionally feed code bad data and see what you get. And then maybe add a test to make sure your code traps such errors. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On 07/23/2010 11:46 PM, Nobody wrote: > On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote: > >> Don't write bare excepts, always catch the error you want and nothing >> else. > > That advice would make more sense if it was possible to know which > exceptions could be raised. In practice, that isn't possible, as the > documentation seldom provides this information. Even for the built-in > classes, the documentation is weak in this regard; for less important > modules and third-party libraries, it's entirely absent. > In practice, at least in Python, it tends to be better to work the "other way around": first, write code without exception handlers. Test. If you get an exception, there are really two possible reactions: 1. "WHAT??" => This shouldn't be happening. Rather than catching everything, fix your code, or think it through until you reach conclusion #2 below. 2. "Ah, yes. Of course. I should check for that." => No problem! You're staring at a traceback right now, so you know the exception raised. If you know there should be an exception, but you don't know which one, it should be trivial to create condition in which the exception arises, should it not? Then, you can handle it properly, without resorting to guesswork or over-generalisations. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, Jul 23, 2010 at 2:46 PM, Nobody wrote: > On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote: > >> Don't write bare excepts, always catch the error you want and nothing >> else. > > That advice would make more sense if it was possible to know which > exceptions could be raised. In practice, that isn't possible, as the > documentation seldom provides this information. Even for the built-in > classes, the documentation is weak in this regard; for less important > modules and third-party libraries, it's entirely absent. > You still don't want to use bare excepts.People tend to get rather annoyed when you handle KeyboardInterrupts and SystemExits like you would a UnicodeError. Use Exception if you don't know what exceptions can be raised. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, 23 Jul 2010 10:42:26 +, Steven D'Aprano wrote: > Don't write bare excepts, always catch the error you want and nothing > else. That advice would make more sense if it was possible to know which exceptions could be raised. In practice, that isn't possible, as the documentation seldom provides this information. Even for the built-in classes, the documentation is weak in this regard; for less important modules and third-party libraries, it's entirely absent. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On 07/23/2010 12:56 PM, dirknbr wrote: > To give a bit of context. I am using twython which is a wrapper for > the JSON API > > > search=twitter.searchTwitter(s,rpp=100,page=str(it),result_type='recent',lang='en') > for u in search[u'results']: > ids.append(u[u'id']) > texts.append(u[u'text']) > > This is where texts comes from. > > When I then want to write texts to a file I get the unicode error. So your data is unicode? Good. Well, files are just streams of bytes, so to write unicode data to one you have to encode it. Since Python can't know which encoding you want to use (utf-8, by the way, if you ask me), you have to do it manually. something like: outfile.write(text.encode('utf-8')) -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
To give a bit of context. I am using twython which is a wrapper for the JSON API search=twitter.searchTwitter(s,rpp=100,page=str(it),result_type='recent',lang='en') for u in search[u'results']: ids.append(u[u'id']) texts.append(u[u'text']) This is where texts comes from. When I then want to write texts to a file I get the unicode error. Dirk -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, Jul 23, 2010 at 3:14 AM, dirknbr wrote: > I am having some problems with unicode from json. > > This is the error I get > > UnicodeEncodeError: 'ascii' codec can't encode character u'\x93' in > position 61: ordinal not in range(128) Please include the full Traceback and the actual code that's causing the error! We aren't mind readers. This error basically indicates that you're incorrectly mixing byte strings and Unicode strings somewhere. Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error
On Fri, 23 Jul 2010 03:14:11 -0700, dirknbr wrote: > I am having some problems with unicode from json. > > This is the error I get > > UnicodeEncodeError: 'ascii' codec can't encode character u'\x93' in > position 61: ordinal not in range(128) > > I have kind of developped this but obviously it's not nice, any better > ideas? > > try: > text=texts[i] > text=text.encode('latin-1') > text=text.encode('utf-8') > except: > text=' ' Don't write bare excepts, always catch the error you want and nothing else. As you've written it, the result of encoding with latin-1 is thrown away, even if it succeeds. text = texts[i] # Don't hide errors here. try: text = text.encode('latin-1') except UnicodeEncodeError: try: text = text.encode('utf-8') except UnicodeEncodeError: text = ' ' do_something_with(text) Another thing you might consider is setting the error handler: text = text.encode('utf-8', errors='ignore') Other error handlers are 'strict' (the default), 'replace' and 'xmlcharrefreplace'. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Unicode error
I am having some problems with unicode from json. This is the error I get UnicodeEncodeError: 'ascii' codec can't encode character u'\x93' in position 61: ordinal not in range(128) I have kind of developped this but obviously it's not nice, any better ideas? try: text=texts[i] text=text.encode('latin-1') text=text.encode('utf-8') except: text=' ' Dirk -- http://mail.python.org/mailman/listinfo/python-list
Re: Python 2.4 vs 2.5 - Unicode error
On Jan 21, 7:08 pm, John Machin wrote: > > To replace non-ASCII characters in a UTF-8-encoded string by spaces: > | >>> u8 = ' and 25\xc2\xb0F' > | >>> u = u8.decode('utf8') > | >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u]) > | ' and 25 F' Thanks John for your reply. This is what I needed. Cheers, Gaurav -- http://mail.python.org/mailman/listinfo/python-list
Re: Python 2.4 vs 2.5 - Unicode error
On Mittwoch, 21. Januar 2009, Gaurav Veda wrote: > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 4357: ordinal not in range(128) > > Before sending the (insert) query to the mysql server, I do the > following which I think should've taken care of this problem: > sqlStr = sqlStr.replace('\\', '') you might consider using what mysql offers about unicode: save all strings encoded as unicode. Might be more work now but I think it would be a good investment in the future. have a look at the mysql documentation for mysql_real_escape_string() takes care of quoted chars. mysql_set_character_set() for setting the character set used by the database connection you can ensure that the web page is unicode by doing something like charsetregex = re.compile(r'charset=(.*?)[\"&]') charsetmatch = charsetregex.search(page) if charsetmatch: charset=charsetmatch.group(1) utf8Text = unicode(page,charset) -- Wolfgang -- http://mail.python.org/mailman/listinfo/python-list
Re: Python 2.4 vs 2.5 - Unicode error
On Jan 22, 9:50 am, Gaurav Veda wrote: > > The 0xc2 strongly suggests that you are feeding the beast data encoded > > in UTF-8 while giving it no reason to believe that it is in fact not > > encoded in ASCII. Curiously the first errant byte is a long way (4KB) > > into your data. Consider doing > > print repr(data) > > to see what you've actually got there. > >>> sqlStr[4352:4362] > > ' and 25\xc2\xb0F' That's the UTF-8 version of ' and 25°F' where the character between the 25 and the F is U+00B0 DEGREE SIGN ... interesting stuff to have in an SQL query string. > > All I want to do is to just replace all the non-ascii characters by a > space. I can't imagine why you would want to do that to data, let alone to an SQL query. I can't see any evidence that you actually tried to do that, anyway. To replace non-ASCII characters in a UTF-8-encoded string by spaces: | >>> u8 = ' and 25\xc2\xb0F' | >>> u = u8.decode('utf8') | >>> ''.join([chr(ord(c)) if c <= u'\x7f' else ' ' for c in u]) | ' and 25 F' > > > I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion -- > > different versions of mysql, perhaps? > > I am trying to put content into the mysql server running on machine A, > from machine B & machine C with different versions of python. So I > don't think this is a mysql issue. Terminology confusion. Consider the possibility of different versions of MySQLdb (the client interface package) on the client machines B and C. Also consider the possibility that you didn't run exactly the same code on B and C. > > Show at the very least the full traceback that you get. Try to write a > > short script that demonstrates the problem with 2.5 and no problem > > with 2.4, so that (a) it is apparent what you are doing (b) the > > problem can be reproduced if necessary by someone with access to > > mysql. How about a very small script which includes the minimum necessary to run these two lines (with appropriate substitutions for column_x and table_y: sql_str = "select column_x from table_y where column_x = '\xc2\xb0'" cursor.execute(sql_str) and run that on B and C > > Traceback (most recent call last): > File "", line 1, in > File "putDataIntoDB.py", line 164, in > cursor.execute(sqlStr) > File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line > 146, in execute > query = query.encode(charset) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 4359: ordinal not in range(128) > > > You might like to explain why you think that doubling backslashes in > > your SQL is a good idea, and amplify "some processing on the text". > > I thought this will achieve 2 things. > a) It will escape any unicode character (obviously, I was wrong. Got > carried away by the display. I thought \xc2 will get escaped to \\xc2, > which is completely preposterous). > b) It will make sure that the escape sequences in the string (e.g. > '\n') are received by mysql as an escape sequence. Run-time programmatic fiddling with an SQL query string is dangerous and tricky at the best of times, worse when you don't inspect the result before you press the launch button. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Python 2.4 vs 2.5 - Unicode error
> The 0xc2 strongly suggests that you are feeding the beast data encoded > in UTF-8 while giving it no reason to believe that it is in fact not > encoded in ASCII. Curiously the first errant byte is a long way (4KB) > into your data. Consider doing > print repr(data) > to see what you've actually got there. >>> sqlStr[4352:4362] ' and 25\xc2\xb0F' All I want to do is to just replace all the non-ascii characters by a space. > I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion -- > different versions of mysql, perhaps? I am trying to put content into the mysql server running on machine A, from machine B & machine C with different versions of python. So I don't think this is a mysql issue. > Show at the very least the full traceback that you get. Try to write a > short script that demonstrates the problem with 2.5 and no problem > with 2.4, so that (a) it is apparent what you are doing (b) the > problem can be reproduced if necessary by someone with access to > mysql. Traceback (most recent call last): File "", line 1, in File "putDataIntoDB.py", line 164, in cursor.execute(sqlStr) File "/usr/lib64/python2.5/site-packages/MySQLdb/cursors.py", line 146, in execute query = query.encode(charset) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4359: ordinal not in range(128) > You might like to explain why you think that doubling backslashes in > your SQL is a good idea, and amplify "some processing on the text". I thought this will achieve 2 things. a) It will escape any unicode character (obviously, I was wrong. Got carried away by the display. I thought \xc2 will get escaped to \\xc2, which is completely preposterous). b) It will make sure that the escape sequences in the string (e.g. '\n') are received by mysql as an escape sequence. Thanks for your reply! Gaurav > HTH, > John -- http://mail.python.org/mailman/listinfo/python-list
Re: Python 2.4 vs 2.5 - Unicode error
On Jan 22, 4:49 am, Gaurav Veda wrote: > Hi, > > I am trying to put some webpages into a mysql database using python > (after some processing on the text). If I use Python 2.4.2, it works > without a fuss. However, on Python 2.5, I get the following error: > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position > 4357: ordinal not in range(128) > > Before sending the (insert) query to the mysql server, I do the > following which I think should've taken care of this problem: > sqlStr = sqlStr.replace('\\', '') > > (where sqlStr is the query). > > Any suggestions? The 0xc2 strongly suggests that you are feeding the beast data encoded in UTF-8 while giving it no reason to believe that it is in fact not encoded in ASCII. Curiously the first errant byte is a long way (4KB) into your data. Consider doing print repr(data) to see what you've actually got there. I'm a little skeptical about the "2.4 works, 2.5 doesn't" notion -- different versions of mysql, perhaps? Show at the very least the full traceback that you get. Try to write a short script that demonstrates the problem with 2.5 and no problem with 2.4, so that (a) it is apparent what you are doing (b) the problem can be reproduced if necessary by someone with access to mysql. You might like to explain why you think that doubling backslashes in your SQL is a good idea, and amplify "some processing on the text". HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Python 2.4 vs 2.5 - Unicode error
Hi, I am trying to put some webpages into a mysql database using python (after some processing on the text). If I use Python 2.4.2, it works without a fuss. However, on Python 2.5, I get the following error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 4357: ordinal not in range(128) Before sending the (insert) query to the mysql server, I do the following which I think should've taken care of this problem: sqlStr = sqlStr.replace('\\', '') (where sqlStr is the query). Any suggestions? Thanks! Gaurav -- http://mail.python.org/mailman/listinfo/python-list
Re: odd unicode error
Martin v. Löwis wrote: >> path += '/' + b >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: >> ordinal not in range(128) >> >> Any ideas? > > path is a Unicode string, b is a byte string and contains the > byte \xd0. > > The problem is that you have a directory with file names in it that > cannot be converted to Unicode strings, using the file system > encoding. If you can't fix the file system, you have to make > search_path a byte string. > > Regards, > Martin I fixed it... I didn't tell the whole story. The interface uses wxpython. It returns a unicode pathname that os.walk() uses. I changed that pathname with str() and now, it no longer barfs. -- http://mail.python.org/mailman/listinfo/python-list
Re: odd unicode error
> path += '/' + b > UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: > ordinal not in range(128) > > Any ideas? path is a Unicode string, b is a byte string and contains the byte \xd0. The problem is that you have a directory with file names in it that cannot be converted to Unicode strings, using the file system encoding. If you can't fix the file system, you have to make search_path a byte string. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
odd unicode error
This: for root, dirs, files in os.walk(search_path): for f in files: print f ### Produces this: Traceback (most recent call last): File "/home/brad/Desktop/my_script.pyw", line 340, in -toplevel- hunt(target_files(search_path, skip_file_extensions(), skip_files()), path_to_results) File "/home/brad/Desktop/my_script.pyw", line 161, in target_files for root, dirs, files in os.walk(search_path): File "os.py", line 291, in walk for x in walk(path, topdown, onerror): File "os.py", line 291, in walk for x in walk(path, topdown, onerror): File "os.py", line 281, in walk if isdir(join(top, name)): File "posixpath.py", line 65, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128) ## I'm running Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2 Any ideas? I can't catch this with try/except and using unicode(f) doesn't help either. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
[EMAIL PROTECTED] wrote: > On Jan 30, 11:28 pm, Walter Dörwald <[EMAIL PROTECTED]> wrote: > >> codecs.register_error("transliterate", transliterate) >> >>Walter > > Really, really slick solution. > Though, why was it [:1], not [0]? ;-) No particular reason, unicodedata.normalize("NFD", ...) should never return an empty string. > And one more thing: >> def transliterate(exc): >> if not isinstance(exc, UnicodeEncodeError): >> raise TypeError("don'ty know how to handle %r" % r) > I don't understand what %r and r are and where they are from. The man > 3 printf page doesn't have %r formatting. %r means format the repr() result, and r was supposed to be exc. ;) Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Martin v. Löwis wrote: > Walter Dörwald schrieb: >> You might try the following: >> >> # -*- coding: iso-8859-1 -*- >> >> import unicodedata, codecs >> >> def transliterate(exc): >> if not isinstance(exc, UnicodeEncodeError): >> raise TypeError("don'ty know how to handle %r" % r) >> return (unicodedata.normalize("NFD", exc.object[exc.start])[:1], >> exc.start+1) > > I think a number of special cases need to be studied here. > I would expect that this is "semantically correct" if the characters > being dropped are combining characters (at least in the languages I'm > familiar with, it is common to drop them for transliteration). True, it might make sense to limit the error handler to handling latin characters. > However, if you do > > py> for i in range(65536): > ... c = unicodedata.normalize("NFD", unichr(i)) > ... for c2 in c[1:]: > ... if not unicodedata.combining(c2): print hex(i),;break > > you'll see that there are many characters which don't decompose > into a base character + sequence of combining characters. In > particular, this involves all hangul syllables (U+AC00..U+D7A3), > for which it is just incorrect to drop the "jungseongs" > (is that proper wording?). Of course the above error handler only makes sense, when the decomposed codepoints are encodable in the target encoding. For your hangul example neither u"\ac00" nor the decomposed version u"\u1100\u1161" er encodable. > There are also some cases which I'm completely uncertain about, > e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E + > ORIYA AI LENGTH MARK. Is it correct to drop the length mark? > It's not listed as a combining character. Likewise, > MYANMAR LETTER UU decomposes to MYANMAR LETTER U + > MYANMAR VOWEL SIGN II; same question here. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
En Wed, 31 Jan 2007 01:21:49 -0300, [EMAIL PROTECTED] <[EMAIL PROTECTED]> escribió: > I don't understand what %r and r are and where they are from. The man > 3 printf page doesn't have %r formatting. Perhaps you should look into the Python docs instead? -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Walter Dörwald schrieb: > You might try the following: > > # -*- coding: iso-8859-1 -*- > > import unicodedata, codecs > > def transliterate(exc): > if not isinstance(exc, UnicodeEncodeError): > raise TypeError("don'ty know how to handle %r" % r) > return (unicodedata.normalize("NFD", exc.object[exc.start])[:1], > exc.start+1) I think a number of special cases need to be studied here. I would expect that this is "semantically correct" if the characters being dropped are combining characters (at least in the languages I'm familiar with, it is common to drop them for transliteration). However, if you do py> for i in range(65536): ... c = unicodedata.normalize("NFD", unichr(i)) ... for c2 in c[1:]: ... if not unicodedata.combining(c2): print hex(i),;break you'll see that there are many characters which don't decompose into a base character + sequence of combining characters. In particular, this involves all hangul syllables (U+AC00..U+D7A3), for which it is just incorrect to drop the "jungseongs" (is that proper wording?). There are also some cases which I'm completely uncertain about, e.g. ORIYA VOWEL SIGN AI decomposes to ORIYA VOWEL SIGN E + ORIYA AI LENGTH MARK. Is it correct to drop the length mark? It's not listed as a combining character. Likewise, MYANMAR LETTER UU decomposes to MYANMAR LETTER U + MYANMAR VOWEL SIGN II; same question here. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
On Jan 30, 11:28 pm, Walter Dörwald <[EMAIL PROTECTED]> wrote: > > codecs.register_error("transliterate", transliterate) > >Walter Really, really slick solution. Though, why was it [:1], not [0]? ;-) And one more thing: > def transliterate(exc): > if not isinstance(exc, UnicodeEncodeError): > raise TypeError("don'ty know how to handle %r" % r) I don't understand what %r and r are and where they are from. The man 3 printf page doesn't have %r formatting. Thanks for the tip. Hieu -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Rares Vernica wrote: > Hi, > > Does anyone know of any Unicode encode/decode error handler that does a > better replace job than the default replace error handler? > > For example I have an iso-8859-1 string that has an 'e' with an accent > (you know, the French 'e's). When I use s.encode('ascii', 'replace') the > 'e' will be replaced with '?'. I would prefer to be replaced with an 'e' > even if I know it is not 100% correct. > > If only this letter would be the problem I would do it manually, but > there is an entire set of letters that need to be replaced with their > closest ascii letter. > > Is there an encode/decode error handler that can replace all the > not-ascii letters from iso-8859-1 with their closest ascii letter? You might try the following: # -*- coding: iso-8859-1 -*- import unicodedata, codecs def transliterate(exc): if not isinstance(exc, UnicodeEncodeError): raise TypeError("don'ty know how to handle %r" % r) return (unicodedata.normalize("NFD", exc.object[exc.start])[:1], exc.start+1) codecs.register_error("transliterate", transliterate) print u"Frédéric Chopin".encode("ascii", "transliterate") Running this script gives you: $ python transliterate.py Frederic Chopin Hope that helps. Servus, Walter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
It does the job. Thanks a lot, Ray Peter Otten wrote: > Rares Vernica wrote: > >> Is there an encode/decode error handler that can replace all the >> not-ascii letters from iso-8859-1 with their closest ascii letter? > > A mapping, not an error handler, but it might do the job: > > http://effbot.org/zone/unicode-convert.htm > > Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Rares Vernica wrote: > Is there an encode/decode error handler that can replace all the > not-ascii letters from iso-8859-1 with their closest ascii letter? No, but IBM's ICU library can transform one script to another in very flexible and capable ways. One such configuration can do what you ask. http://www-306.ibm.com/software/globalization/icu/index.jsp http://icu.sourceforge.net/userguide/Transform.html Unfortunately, I don't think any of the available ICU bindings for Python have exposed this functionality. If you wanted to contribute such, you might want to start with PyICU. It seems to be the most actively developed of the bindings. http://pyicu.osafoundation.org/ Of course, that's overkill for this problem. Those transformations can handle such things as this: Αλφαβητικός Κατάλογος Alphabētikós Katálogos The number of characters in iso-8859-1 that you would want to transliterate is not all that large. You could spend a little bit of time going through the character set and making a translation map for str.translate(). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error handler
Rares Vernica wrote: > Is there an encode/decode error handler that can replace all the > not-ascii letters from iso-8859-1 with their closest ascii letter? A mapping, not an error handler, but it might do the job: http://effbot.org/zone/unicode-convert.htm Peter -- http://mail.python.org/mailman/listinfo/python-list
Unicode error handler
Hi, Does anyone know of any Unicode encode/decode error handler that does a better replace job than the default replace error handler? For example I have an iso-8859-1 string that has an 'e' with an accent (you know, the French 'e's). When I use s.encode('ascii', 'replace') the 'e' will be replaced with '?'. I would prefer to be replaced with an 'e' even if I know it is not 100% correct. If only this letter would be the problem I would do it manually, but there is an entire set of letters that need to be replaced with their closest ascii letter. Is there an encode/decode error handler that can replace all the not-ascii letters from iso-8859-1 with their closest ascii letter? Thanks a lot, Ray -- http://mail.python.org/mailman/listinfo/python-list
RE: Unicode Error
[Gallagher, Tim (NE)] | Hey all I am learning Python and having a fun time doing so. | I have a question for y'all, it has to do with active directory. | I want to get the last login for a computer from Active | Directory. I am using the active_directory module and here | is my code. [START] import active_directory computer = active_directory.root() for cpu in computer.search ("cn='Computer_Name'"): print cpu.samAccountName#←--- Works find print cpu.operatingSystem #←--- Works find print cpu.lastLogon #←--- Getting Error [END] | I get an error that I am not sure what to do with, the error | is TypeError: coercing to Unicode: need string or buffer, | instance found in my line Do I have to change the output to | meet Unicode formation? I started to write an explanation of Unicode and what an encoding was and why you needed it, but then I realised that it wouldn't help - at least not here - because the problem seems to involve converting the value in cpu.lastLogon to Unicode. And I'm not sure why it's even trying to do that. The lastLogon value (according to the MS docs) is actually a structure in its own right with a HighPart and a LowPart, and you perform various maths on these numbers to give you a real date. In my case (cf code below) if I simply print the lastLogon, I get the anonymous string. import active_directory me = active_directory.find_computer () print me.samAccountName print me.lastLogon # gives > print me.lastLogon.HighPart, me.lastLogon.LowPart # gives two long numbers Short answer, try lastLogon.HighPart & lastLogon.LowPart TJG This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk -- http://mail.python.org/mailman/listinfo/python-list
Unicode Error
Hey all I am learning Python and having a fun time doing so. I have a question for y'all, it has to do with active directory. I want to get the last login for a computer from Active Directory. I am using the active_directory module and here is my code. [START] import active_directory computer = active_directory.root() for cpu in computer.search ("cn='Computer_Name'"): print cpu.samAccountName ←--- Works find print cpu.operatingSystem ←--- Works find print cpu.lastLogon ←--- Getting Error [END] I get an error that I am not sure what to do with, the error is TypeError: coercing to Unicode: need string or buffer, instance found in my line Do I have to change the output to meet Unicode formation? Thanks, -T -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode error
[EMAIL PROTECTED] wrote: > I have this python code: > print >> htmlFile, " style=\"width: 200px; height:18px;\">"; > > > But that caues this error, and I can't figure it out why. Any help is > appreicate > File "./run.py", line 193, in ? > print >> htmlFile, " style=\"width: 200px; height:18px;\">"; > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: > ordinal not in range(128) > > Thanks. > You can make the code easier to read by using single quotes to quote strings with double quotes inside: print >> htmlFile, ('') Or even better: print >> htmlFile, (u'') % unicode(1) The unicode(1) confuses me -- you are converting an integer to its string representation in unicode (do you know that?), not picking a particular character. print >> htmlFile, (u'') % (1,) And if you don't mean to be writing unicode, you could use: print >> htmlFile, ('') % (1,) --Scott David Daniels [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode error
[EMAIL PROTECTED] wrote: > I have this python code: > print >> htmlFile, " style=\"width: 200px; height:18px;\">"; > > > But that caues this error, and I can't figure it out why. Any help is > appreicate > File "./run.py", line 193, in ? > print >> htmlFile, " style=\"width: 200px; height:18px;\">"; > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: > ordinal not in range(128) > > Thanks. > Hi, I tried and it worked (wrote into the file:). Can you try to isolate exactly what part of the code is wrong ? jm Here is the complete code: htmlfile=file('jmbc.txt','w') print >> htmlfile, ""; htmlfile.close() -- http://mail.python.org/mailman/listinfo/python-list
unicode error
I have this python code: print >> htmlFile, ""; But that caues this error, and I can't figure it out why. Any help is appreicate File "./run.py", line 193, in ? print >> htmlFile, ""; UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 9: ordinal not in range(128) Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode error in wx_gdi ?
Erik Bethke wrote: > Hello All, > > I still shaking out my last few bugs in my tile matching game: > > I am now down to one stumper for me: > 1) when I initialize wxPython > 2) from an exe that I have created with py2exe > 3) when the executable is located on the desktop as opposed to > somewhere on C or D directly > 4) when My Desktop is not written in ascii but instead Korean hangul > > I get this error: > > Traceback (most recent call last): > File "shanghai.py", line 13, in ? > File "wxPython\__init__.pyc", line 10, in ? > File "wxPython\_wx.pyc", line 3, in ? > File "wxPython\_core.pyc", line 15, in ? > File "wx\__init__.pyc", line 42, in ? > File "wx\_core.pyc", line 10994, in ? > File "wx\_gdi.pyc", line 2443, in ? > File "wx\_gdi.pyc", line 2340, in Locale_AddCatalogLookupPathPrefix > UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position > 26: ordinal not in range(128) > > Granted this may seem like an obscure error, Thanks to your explanation, it doesn't look very obscure. I think the code in wxpython either uses sys.path[0] or __file__. Python still keeps byte strings in there because of backward compatibility. > What do i do from here? Do I go into wx_gdi.py and fix it so that it > uses unicode instead of ascii? I have not yet made any changes to > other people's libraries... You should contact wxpython people for proper cross platform fix, meanwhile you can fix that particular error on windows by changing sys.path[0] into sys.path[0].decode(sys.getfilesystemencoding()) or do the same thing for __file__. If there are a lot of similar problems, you can call sys.setdefaultencoding('mbcs') at the start of your program as last resort. Don't tell anyone I suggested that :) and remember that sys.setdefaultencoding is removed in site.py, changing default encoding can mask encoding bugs and make those bugs hard to trace. Serge. -- http://mail.python.org/mailman/listinfo/python-list
Unicode error in wx_gdi ?
Hello All, I still shaking out my last few bugs in my tile matching game: I am now down to one stumper for me: 1) when I initialize wxPython 2) from an exe that I have created with py2exe 3) when the executable is located on the desktop as opposed to somewhere on C or D directly 4) when My Desktop is not written in ascii but instead Korean hangul I get this error: Traceback (most recent call last): File "shanghai.py", line 13, in ? File "wxPython\__init__.pyc", line 10, in ? File "wxPython\_wx.pyc", line 3, in ? File "wxPython\_core.pyc", line 15, in ? File "wx\__init__.pyc", line 42, in ? File "wx\_core.pyc", line 10994, in ? File "wx\_gdi.pyc", line 2443, in ? File "wx\_gdi.pyc", line 2340, in Locale_AddCatalogLookupPathPrefix UnicodeDecodeError: 'ascii' codec can't decode byte 0xbf in position 26: ordinal not in range(128) Granted this may seem like an obscure error, but the net effect is that I cannot use wxPython for my games and applications as many of my users will place the executable directly on their desktop and the path of the desktop contains non-ascii paths. What do i do from here? Do I go into wx_gdi.py and fix it so that it uses unicode instead of ascii? I have not yet made any changes to other people's libraries... Any help would be much appreciated, -Erik -- http://mail.python.org/mailman/listinfo/python-list