Another (simple) unicode question
Construct http://construct.wikispaces.com/ is a kick-ass binary file structurer (written by a 21 year old!) I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. Can anyone direct me to what I should read to try to understand this? -- http://mail.python.org/mailman/listinfo/python-list
Re: Another (simple) unicode question
On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote: Constructhttp://construct.wikispaces.com/is a kick-ass binary file structurer (written by a 21 year old!) I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. Can anyone direct me to what I should read to try to understand this? unicode related stuff is rather vague. Have you read the Python Unicode HOWTO? Joel Spolsky's article? http://www.amk.ca/python/howto/unicode http://www.joelonsoftware.com/articles/Unicode.html In any case, it's a debugging problem, isn't it? Could you possibly consider telling us the error message, the traceback, a few lines of the 3.x code around where the problem is, and the corresponding 2.x lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6? -- http://mail.python.org/mailman/listinfo/python-list
Re: Another (simple) unicode question
On Oct 29, 4:02 am, Rustom Mody rustompm...@gmail.com wrote: Constructhttp://construct.wikispaces.com/is a kick-ass binary file structurer (written by a 21 year old!) I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. 2to3 isn't a general Python 2 to Python 3 translator. You can't pass any old Python 2.x code through 2to3 and expect it to work. Rather, you have to write the Python 2.x code in a subset of Python that I call transitional dialect. In order to port to Python 3 using 2to3, you first have to port it to this transitional dialect. If Unicode is the issue, one thing you should do to explicitly classify all strings as binary or text in Python 2.x. This means to change str() to unicode() or bytes(), whichever is appropriate, and to change to u or b. Carl Banks -- http://mail.python.org/mailman/listinfo/python-list
Re: Another (simple) unicode question
John Machin wrote: On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote:... I thought of trying to port it to python3 but it barfs on some unicode related stuff (after running 2to3) which I am unable to wrap my head around. Can anyone direct me to what I should read to try to understand this? to which Jon replied with some good links to start, and then: In any case, it's a debugging problem, isn't it? Could you possibly consider telling us the error message, the traceback, a few lines of the 3.x code around where the problem is, and the corresponding 2.x lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6? Also consider how 2to3 translates the problem section(s). --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
En Wed, 28 Oct 2009 02:28:01 -0300, Chris Jones cjns1...@gmail.com escribió: On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote: Chris Jones wrote: Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; RFC 3629: ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. In other words, Unicode is not related to any encoding .. and yet the UTF-8, UTF-16.. encoding forms are clearly related to Unicode. How is that possible? Start reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), by Joel Spolsky. http://www.joelonsoftware.com/articles/Unicode.html -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Chris Jones cjns1...@gmail.com wrote in message news:mailman.2149.1256707687.2807.python-l...@python.org... On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote: Chris Jones wrote: [..] Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; RFC 3629: ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. In other words, Unicode is not related to any encoding .. and yet the UTF-8, UTF-16.. encoding forms are clearly related to Unicode. How is that possible? CJ When I first saw it, my first thought was that the subjectline was an oxymoron. --Tim Arnold -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote: Chris Jones wrote: [..] Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; RFC 3629: ISO/IEC 10646 and Unicode define several encoding forms of their common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. In other words, Unicode is not related to any encoding .. and yet the UTF-8, UTF-16.. encoding forms are clearly related to Unicode. How is that possible? CJ -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Chris Jones wrote: On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote: [..] Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: I knew nothing about UTF-16 friends before this thread. Best part of Unicode is that there are multiple encodings, right? ;-) No, the best part about Unicode is there is no encoding! Unicode does not define any encoding; what it defines is code-points for characters which is not related to how characters are encoded in files or network transmission. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió: On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. I don't think so. Nesting isn't the only problem. RE's cannot handle comments, by example. And you must support unquoted attributes, single and double quotes, any attribute ordering, empty tags, arbitrary whitespace... If you don't, you are not reading XML (or HTML), only a specific file format that resembles XML but actually isn't. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote: [..] Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: I knew nothing about UTF-16 friends before this thread. Best part of Unicode is that there are multiple encodings, right? ;-) Moot point on xterm anyway, since you'd be hard put to it to find a decent terminal font that covers anything outside the BMP. Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 unichr(0x1) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) Note that narrow builds do understand names outside of the BMP, and generate surrogate pairs for them: u'\N{LINEAR B SYLLABLE B008 A}' u'\U0001' len(_) 2 Whether or not using surrogates in this context is a good idea is open to debate. What's the advantage of a multi-wchar string over a multi-byte string? I don't understand this last remark, but since I'm only a GNU/Linux hobbyist, I guess it doesn't make much difference. Thanks for the code snippet and comments. CJ -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On 10/22/2009 03:23 AM, Gabriel Genellina wrote: En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió: On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. I don't think so. Nesting isn't the only problem. RE's cannot handle comments, by example. And you must support unquoted attributes, single and double quotes, any attribute ordering, empty tags, arbitrary whitespace... If you don't, you are not reading XML (or HTML), only a specific file format that resembles XML but actually isn't. OK, then let me rephrase my point as: in the real world it is often not necessary to parse XML in it's full generality; parsing, as you put it, a specific file format that resembles XML is all that is really needed. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
En Thu, 22 Oct 2009 17:08:21 -0300, ru...@yahoo.com escribió: On 10/22/2009 03:23 AM, Gabriel Genellina wrote: En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió: On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. I don't think so. Nesting isn't the only problem. RE's cannot handle comments, by example. And you must support unquoted attributes, single and double quotes, any attribute ordering, empty tags, arbitrary whitespace... If you don't, you are not reading XML (or HTML), only a specific file format that resembles XML but actually isn't. OK, then let me rephrase my point as: in the real world it is often not necessary to parse XML in it's full generality; parsing, as you put it, a specific file format that resembles XML is all that is really needed. Given that using a real XML parser like ElementTree is as easy as (or even easier than) building a regular expression, and more robust, and more likely to survive small changes in the input format, why use the worse solution? RE's are good in solving some problems, but parsing XML isn't one of those. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan george.tro...@noaa.gov wrote in message news:hbktk6$8b...@news.nems.noaa.gov... Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? George Scott David Daniels wrote: Mark Tolonen wrote: Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) Mark is right about the source, but you needn't write unicode source to process unicode data. Since nobody else mentioned my favorite way of writing unicode in ASCII, try: IDLE 2.6.3 s = '''48\xc2\xb0 13' 16.80 N''' q = s.decode('utf-8') degrees, rest = q.split(u'\N{DEGREE SIGN}') print degrees 48 print rest 13' 16.80 N And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' It wouldn't be your favorite way if you were typing Chinese: x = u'我是美国人。' vs. x = u'\N{CJK UNIFIED IDEOGRAPH-6211}\N{CJK UNIFIED IDEOGRAPH-662F}\N{CJK UNIFIED IDEOGRAPH-7F8E}\N{CJK UNIFIED IDEOGRAPH-56FD}\N{CJK UNIFIED IDEOGRAPH-4EBA}\N{IDEOGRAPHIC FULL STOP}' ;^) Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan wrote: Scott David Daniels wrote: ... And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? I thought the mention of unicodedata would make it clear. for n in xrange(sys.maxunicode+1): try: nm = unicodedata.name(unichr(n)) except ValueError: pass else: if 'tortoise' in nm.lower(): print n, nm --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Wed, Oct 21, 2009 at 12:20:35AM EDT, Nobody wrote: On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote: [..] Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? You can get them from the unicodedata module, e.g.: import unicodedata for i in xrange(0x1): n = unicodedata.name(unichr(i),None) if n is not None: print i, n Python rocks! Just curious, why did you choose to set the upper boundary at 0x? CJ -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Wed, 21 Oct 2009 05:16:56 -0400, Chris Jones wrote: Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? You can get them from the unicodedata module, e.g.: import unicodedata for i in xrange(0x1): n = unicodedata.name(unichr(i),None) if n is not None: print i, n Python rocks! Just curious, why did you choose to set the upper boundary at 0x? Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 unichr(0x1) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) Note that narrow builds do understand names outside of the BMP, and generate surrogate pairs for them: u'\N{LINEAR B SYLLABLE B008 A}' u'\U0001' len(_) 2 Whether or not using surrogates in this context is a good idea is open to debate. What's the advantage of a multi-wchar string over a multi-byte string? -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Oct 21, 4:59 am, Bruno Desthuilliers bruno. 42.desthuilli...@websiteburo.invalid wrote: beSTEfar a écrit : (snip) When parsing strings, use Regular Expressions. And now you have _two_ problems g For some simple parsing problems, Python's string methods are powerful enough to make REs overkill. And for any complex enough parsing (any recursive construct for example - think XML, HTML, any programming language etc), REs are just NOT enough by themselves - you need a full blown parser. But keep in mind that many XML, HTML, etc parsing problems are restricted to a subset where you know the nesting depth is limited (often to 0 or 1), and for that large set of problems, RE's *are* enough. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Nobody wrote: Just curious, why did you choose to set the upper boundary at 0x? Characters outside the 16-bit range aren't supported on all builds. They won't be supported on most Windows builds, as Windows uses 16-bit Unicode extensively: Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 unichr(0x1) Traceback (most recent call last): File stdin, line 1, in module ValueError: unichr() arg not in range(0x1) (narrow Python build) In Python 3, if not 2.6, chr(0x1) (what used to be unichr()) works fine on Windows, and generates the appropriate surrogate pair. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Mark Tolonen wrote: Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) Mark is right about the source, but you needn't write unicode source to process unicode data. Since nobody else mentioned my favorite way of writing unicode in ASCII, try: IDLE 2.6.3 s = '''48\xc2\xb0 13' 16.80 N''' q = s.decode('utf-8') degrees, rest = q.split(u'\N{DEGREE SIGN}') print degrees 48 print rest 13' 16.80 N And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? George Scott David Daniels wrote: Mark Tolonen wrote: Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) Mark is right about the source, but you needn't write unicode source to process unicode data. Since nobody else mentioned my favorite way of writing unicode in ASCII, try: IDLE 2.6.3 s = '''48\xc2\xb0 13' 16.80 N''' q = s.decode('utf-8') degrees, rest = q.split(u'\N{DEGREE SIGN}') print degrees 48 print rest 13' 16.80 N And if you are unsure of the name to use: import unicodedata unicodedata.name(u'\xb0') 'DEGREE SIGN' --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote: Thanks for all suggestions. It took me a while to find out how to configure my keyboard to be able to type the degree sign. I prefer to stick with pure ASCII if possible. Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? You can get them from the unicodedata module, e.g.: import unicodedata for i in xrange(0x1): n = unicodedata.name(unichr(i),None) if n is not None: print i, n -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt Is that the place to look? Correct - you are supposed to fill in a Unicode character name into the \N escape. The specific list of names depends on the version of the UCD which was used in the specific Python version, but the characters you are likely interested in probably had been defined forever. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
a simple unicode question
A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? George -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan schrieb: A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? Instead of this rather convoluted way to specify a degree-sign, better do # -*- coding: utf-8 -*- ... int(r[:r.find(u°)]) Please note that the utf-8-encoding has *nothing* todo with your string - it's just the source-file encoding. Of course your editor must use utf-8 for saving the encoding. Or you can use any other one you like. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
On 19 Okt, 21:07, George Trojan george.tro...@noaa.gov wrote: A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? George When parsing strings, use Regular Expressions. If you don't know how to, spend some time teaching yourself how to - well spent time! A great tool for playing around with REs is KODOS. For the problem at hand you can e.g.: import re degrees = int(re.findall('\d+', s)[0]) that in essence will group together all groups of consecutive digits, return the first group and int() it. No need to care/know about the fact that the string is Unicode and the underlying coding of the charset. -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan george.tro...@noaa.gov wrote in message news:hbidd7$i9...@news.nems.noaa.gov... A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN: -- http://mail.python.org/mailman/listinfo/python-list
Re: a simple unicode question
George Trojan george.tro...@noaa.gov wrote in message news:hbidd7$i9...@news.nems.noaa.gov... A trivial one, this is the first time I have to deal with Unicode. I am trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is iso-8859-1. To get the degrees I did encoding='iso-8859-1' q=s.decode(encoding) q.split() [u'48\xc2\xb0', u13', u'16.80', u'N'] r=q.split()[0] int(r[:r.find(unichr(ord('\xc2')))]) 48 Is there a better way of getting the degrees? It seems your string is UTF-8. \xc2\xb0 is UTF-8 for DEGREE SIGN. If you type non-ASCII characters in source code, make sure to declare the encoding the file is *actually* saved in: # coding: utf-8 s = '''48° 13' 16.80 N''' q = s.decode('utf-8') # next line equivalent to previous two q = u'''48° 13' 16.80 N''' # couple ways to find the degrees print int(q[:q.find(u'°')]) import re print re.search(ur'(\d+)°',q).group(1) -Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sun, 30 Aug 2009 02:36:49 +, Steven D'Aprano wrote: So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. Did I mention UTF-8? Out of curiosity, why do you say that UTF-8 isn't sensible for terminals? I don't think I've ever seen a terminal (whether an emulator running on a PC or a hardware terminal) which supports anything like the entire Unicode repertoire, along with right-to-left writing, complex scripts, etc. Even support for double-width characters is uncommon. If your terminal can't handle anything outside of ISO-8859-1, there isn't any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix tty driver will delete the last *byte* from the input buffer when you press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard). Historically, terminal I/O has tended to revolve around unibyte encodings, with everything except the endpoints being encoding-agnostic. Anything which falls outside of that is a dog's breakfast; it's no coincidence that the word for messed-up text (arising from an encoding mismatch) was borrowed from Japanese (mojibake). Life is simpler if you can use a unibyte encoding. Apart from anything else, the failure modes tend to be harmless. E.g. you get the wrong glyph rather than two glyphs where you expected one. On a 7-bit channel, you get the wrong printable character rather than a control character (this is why ISO-8859-* reserves \x80-\x9F as control codes rather than using them as printable characters). And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I never mentioned Unicode font either. In any case, there's no reason why a skillful designer can't make a single font which covers the entire Unicode range in a consistent style. Consistency between unrelated scripts is neither realistic nor desirable. E.g. Latin fonts tend to use uniform stroke widths unless they're specifically designed to look like handwriting, whereas Han fonts tend to prefer variable-width strokes which reflect the direction. The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. Surely the main advantage of Unicode is that it gives you a full and consistent range of characters not limited to the 128 characters provided by ASCII? Nothing stops you from using other encodings, or from using multiple encodings. But using multiple encodings means keeping track of the encodings. This isn't impossible, and it may produce better results (e.g. no information loss from Han unification), but it can be a lot more work. -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700) Further, does anything, except a printing device need to know the encoding of a piece of text? Python needs to know if you are processing the text. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote: * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700) Further, does anything, except a printing device need to know the encoding of a piece of text? Python needs to know if you are processing the text. Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. If all your text contains nothing but ASCII characters, you should never need to worry about encodings at all. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote: Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. Why would you generate text strings and not output them anywhere? The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote: On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote: Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. Did I mention UTF-8? Out of curiosity, why do you say that UTF-8 isn't sensible for terminals? And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I never mentioned Unicode font either. In any case, there's no reason why a skillful designer can't make a single font which covers the entire Unicode range in a consistent style. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. Why would you generate text strings and not output them anywhere? Who knows? It doesn't matter -- the point is that you can if you want to. You only need to worry about encodings at input and output, therefore logically if you don't do I/O you can process strings all day long and never worry about encodings at all. The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. Surely the main advantage of Unicode is that it gives you a full and consistent range of characters not limited to the 128 characters provided by ASCII? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
Further, does anything, except a printing device need to know the encoding of a piece of text? I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) If you're using Python 2.x, though, I believe if you simply set the file opening mode to binary then data you read() should still be treated as an array of bytes, although you may encounter issues trying to access the n'th character. Please do correct me if I'm wrong, anyone. On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh shashank.sunny.si...@gmail.com wrote: Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. Right? --shashank -- Rami Chowdhury Never attribute to malice that which can be attributed to stupidity -- Hanlon's Razor 408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD) -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote: Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. If you are just reading and writing bytes then you are just reading and writing bytes. Where you need to worry about unicode, etc. is when you start treating a series of bytes as TEXT (e.g. how many *characters* are in this byte array).* This is no different, IMO, than treating a byte stream vs a image file. You don't, need to worry about resolution, palette, bit-depth, etc. if you are only treating as a stream of bytes. The only difference between the two is that in Python unicode is a built-in type and image isn't ;) * Just make sure that if you are manipulating byte streams independent of it's textual representation that you open files, e.g., in binary mode. -a -- http://mail.python.org/mailman/listinfo/python-list