Another (simple) unicode question

2009-10-29 Thread Rustom Mody
Construct http://construct.wikispaces.com/ is a kick-ass binary file
structurer (written by a 21 year old!)
I thought of trying to port it to python3 but it barfs on some unicode
related stuff (after running 2to3) which I am unable to wrap my head
around.

Can anyone direct me to what I should read to try to understand this?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Another (simple) unicode question

2009-10-29 Thread John Machin
On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote:
 Constructhttp://construct.wikispaces.com/is a kick-ass binary file
 structurer (written by a 21 year old!)
 I thought of trying to port it to python3 but it barfs on some unicode
 related stuff (after running 2to3) which I am unable to wrap my head
 around.

 Can anyone direct me to what I should read to try to understand this?

unicode related stuff is rather vague. Have you read the Python
Unicode HOWTO? Joel Spolsky's article?

http://www.amk.ca/python/howto/unicode
http://www.joelonsoftware.com/articles/Unicode.html

In any case, it's a debugging problem, isn't it? Could you possibly
consider telling us the error message, the traceback, a few lines of
the 3.x code around where the problem is, and the corresponding 2.x
lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Another (simple) unicode question

2009-10-29 Thread Carl Banks
On Oct 29, 4:02 am, Rustom Mody rustompm...@gmail.com wrote:
 Constructhttp://construct.wikispaces.com/is a kick-ass binary file
 structurer (written by a 21 year old!)
 I thought of trying to port it to python3 but it barfs on some unicode
 related stuff (after running 2to3) which I am unable to wrap my head
 around.

2to3 isn't a general Python 2 to Python 3 translator.  You can't pass
any old Python 2.x code through 2to3 and expect it to work.  Rather,
you have to write the Python 2.x code in a subset of Python that I
call transitional dialect.  In order to port to Python 3 using 2to3,
you first have to port it to this transitional dialect.

If Unicode is the issue, one thing you should do to explicitly
classify all strings as binary or text in Python 2.x.  This means to
change str() to unicode() or bytes(), whichever is appropriate, and to
change  to u or b.


Carl Banks
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Another (simple) unicode question

2009-10-29 Thread Scott David Daniels

John Machin wrote:

On Oct 29, 10:02 pm, Rustom Mody rustompm...@gmail.com wrote:...

I thought of trying to port it to python3 but it barfs on some unicode
related stuff (after running 2to3) which I am unable to wrap my head
around.

Can anyone direct me to what I should read to try to understand this?


to which Jon replied with some good links to start, and then:


In any case, it's a debugging problem, isn't it? Could you possibly
consider telling us the error message, the traceback, a few lines of
the 3.x code around where the problem is, and the corresponding 2.x
lines? Are you using 3.1.1 and 2.6.4? Does your test work in 2.6?


Also consider how 2to3 translates the problem section(s).

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-28 Thread Gabriel Genellina
En Wed, 28 Oct 2009 02:28:01 -0300, Chris Jones cjns1...@gmail.com  
escribió:

On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote:

Chris Jones wrote:

Best part of Unicode is that there are multiple encodings, right? ;-)

No, the best part about Unicode is there is no encoding!
Unicode does not define any encoding;


RFC 3629:
ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.


what it defines is code-points for  characters which is not related to
how characters are encoded in files or network transmission.


In other words, Unicode is not related to any encoding .. and yet the
UTF-8, UTF-16.. encoding forms are clearly related to Unicode.

How is that possible?


Start reading The Absolute Minimum Every Software Developer Absolutely,  
Positively Must Know About Unicode and Character Sets (No Excuses!), by  
Joel Spolsky.

http://www.joelonsoftware.com/articles/Unicode.html

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-28 Thread Tim Arnold
Chris Jones cjns1...@gmail.com wrote in message 
news:mailman.2149.1256707687.2807.python-l...@python.org...
 On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote:
 Chris Jones wrote:

 [..]

 Best part of Unicode is that there are multiple encodings, right? ;-)

 No, the best part about Unicode is there is no encoding!

 Unicode does not define any encoding;

 RFC 3629:

 ISO/IEC 10646 and Unicode define several encoding forms of their
 common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.

 what it defines is code-points for  characters which is not related to
 how characters are encoded in files or network transmission.

 In other words, Unicode is not related to any encoding .. and yet the
 UTF-8, UTF-16.. encoding forms are clearly related to Unicode.

 How is that possible?

 CJ

When I first saw it, my first thought was that the subjectline was an 
oxymoron.

--Tim Arnold


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-27 Thread Chris Jones
On Tue, Oct 27, 2009 at 06:21:11AM EDT, Lie Ryan wrote:
 Chris Jones wrote:

[..]

 Best part of Unicode is that there are multiple encodings, right? ;-)

 No, the best part about Unicode is there is no encoding!

 Unicode does not define any encoding; 

RFC 3629:

ISO/IEC 10646 and Unicode define several encoding forms of their
common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.

 what it defines is code-points for  characters which is not related to
 how characters are encoded in files or network transmission.

In other words, Unicode is not related to any encoding .. and yet the
UTF-8, UTF-16.. encoding forms are clearly related to Unicode.

How is that possible?

CJ
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-27 Thread Lie Ryan

Chris Jones wrote:

On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]


Characters outside the 16-bit range aren't supported on all builds.
They won't be supported on most Windows builds, as Windows uses 16-bit
Unicode extensively:


I knew nothing about UTF-16  friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)


No, the best part about Unicode is there is no encoding!

Unicode does not define any encoding; what it defines is code-points for 
characters which is not related to how characters are encoded in files 
or network transmission.

--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-22 Thread Gabriel Genellina

En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió:


On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
42.desthuilli...@websiteburo.invalid wrote:

beSTEfar a écrit :
(snip)
  When parsing strings, use Regular Expressions.

And now you have _two_ problems g

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.


But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.


I don't think so. Nesting isn't the only problem. RE's cannot handle  
comments, by example. And you must support unquoted attributes, single and  
double quotes, any attribute ordering, empty tags, arbitrary whitespace...  
If you don't, you are not reading XML (or HTML), only a specific file  
format that resembles XML but actually isn't.


--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-22 Thread Chris Jones
On Wed, Oct 21, 2009 at 12:35:11PM EDT, Nobody wrote:

[..]

 Characters outside the 16-bit range aren't supported on all builds.
 They won't be supported on most Windows builds, as Windows uses 16-bit
 Unicode extensively:

I knew nothing about UTF-16  friends before this thread.

Best part of Unicode is that there are multiple encodings, right? ;-)

Moot point on xterm anyway, since you'd be hard put to it to find a
decent terminal font that covers anything outside the BMP.

   Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
   (Intel)] on win32

unichr(0x1)
   Traceback (most recent call last):
 File stdin, line 1, in module
   ValueError: unichr() arg not in range(0x1) (narrow Python build)
 
 Note that narrow builds do understand names outside of the BMP, and
 generate surrogate pairs for them:
 
u'\N{LINEAR B SYLLABLE B008 A}'
   u'\U0001'
len(_)
   2
 
 Whether or not using surrogates in this context is a good idea is open to
 debate. What's the advantage of a multi-wchar string over a multi-byte
 string?

I don't understand this last remark, but since I'm only a GNU/Linux
hobbyist, I guess it doesn't make much difference.

Thanks for the code snippet and comments.

CJ
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-22 Thread rurpy
On 10/22/2009 03:23 AM, Gabriel Genellina wrote:
 En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió:

 On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
 42.desthuilli...@websiteburo.invalid wrote:
 beSTEfar a écrit :
 (snip)
   When parsing strings, use Regular Expressions.

 And now you have _two_ problems g

 For some simple parsing problems, Python's string methods are powerful
 enough to make REs overkill. And for any complex enough parsing (any
 recursive construct for example - think XML, HTML, any programming
 language etc), REs are just NOT enough by themselves - you need a full
 blown parser.

 But keep in mind that many XML, HTML, etc parsing problems
 are restricted to a subset where you know the nesting depth
 is limited (often to 0 or 1), and for that large set of
 problems, RE's *are* enough.

 I don't think so. Nesting isn't the only problem. RE's cannot handle
 comments, by example. And you must support unquoted attributes, single and
 double quotes, any attribute ordering, empty tags, arbitrary whitespace...
 If you don't, you are not reading XML (or HTML), only a specific file
 format that resembles XML but actually isn't.

OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, a specific file format that resembles XML is all that is
really needed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-22 Thread Gabriel Genellina

En Thu, 22 Oct 2009 17:08:21 -0300, ru...@yahoo.com escribió:


On 10/22/2009 03:23 AM, Gabriel Genellina wrote:

En Wed, 21 Oct 2009 15:14:32 -0300, ru...@yahoo.com escribió:


On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
42.desthuilli...@websiteburo.invalid wrote:

beSTEfar a écrit :
(snip)
  When parsing strings, use Regular Expressions.

And now you have _two_ problems g

For some simple parsing problems, Python's string methods are powerful
enough to make REs overkill. And for any complex enough parsing (any
recursive construct for example - think XML, HTML, any programming
language etc), REs are just NOT enough by themselves - you need a full
blown parser.


But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.


I don't think so. Nesting isn't the only problem. RE's cannot handle
comments, by example. And you must support unquoted attributes, single  
and
double quotes, any attribute ordering, empty tags, arbitrary  
whitespace...

If you don't, you are not reading XML (or HTML), only a specific file
format that resembles XML but actually isn't.


OK, then let me rephrase my point as: in the real world it is often
not necessary to parse XML in it's full generality; parsing, as you
put it, a specific file format that resembles XML is all that is
really needed.


Given that using a real XML parser like ElementTree is as easy as (or even  
easier than) building a regular expression, and more robust, and more  
likely to survive small changes in the input format, why use the worse  
solution?

RE's are good in solving some problems, but parsing XML isn't one of those.

--
Gabriel Genellina

--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread Mark Tolonen


George Trojan george.tro...@noaa.gov wrote in message 
news:hbktk6$8b...@news.nems.noaa.gov...

Thanks for all suggestions. It took me a while to find out how to
configure my keyboard to be able to type the degree sign. I prefer to
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
Is that the place to look?

George

Scott David Daniels wrote:

Mark Tolonen wrote:

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If 
you type non-ASCII characters in source code, make sure to declare the 
encoding the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)



Mark is right about the source, but you needn't write unicode source
to process unicode data.  Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.3
  s = '''48\xc2\xb0 13' 16.80 N'''
  q = s.decode('utf-8')
  degrees, rest = q.split(u'\N{DEGREE SIGN}')
  print degrees
48
  print rest
 13' 16.80 N

And if you are unsure of the name to use:
  import unicodedata
  unicodedata.name(u'\xb0')
'DEGREE SIGN'


It wouldn't be your favorite way if you were typing Chinese:

x = u'我是美国人。'

vs.

x = u'\N{CJK UNIFIED IDEOGRAPH-6211}\N{CJK UNIFIED IDEOGRAPH-662F}\N{CJK 
UNIFIED IDEOGRAPH-7F8E}\N{CJK UNIFIED IDEOGRAPH-56FD}\N{CJK UNIFIED 
IDEOGRAPH-4EBA}\N{IDEOGRAPHIC FULL STOP}'


;^) Mark





--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread Scott David Daniels

George Trojan wrote:

Scott David Daniels wrote:

...

And if you are unsure of the name to use:
  import unicodedata
  unicodedata.name(u'\xb0')
'DEGREE SIGN'


 Thanks for all suggestions. It took me a while to find out how to
 configure my keyboard to be able to type the degree sign. I prefer to
 stick with pure ASCII if possible.
 Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
 http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
 Is that the place to look?

I thought the mention of unicodedata would make it clear.

 for n in xrange(sys.maxunicode+1):
try:
nm = unicodedata.name(unichr(n))
except ValueError: pass
else:
if 'tortoise' in nm.lower(): print n, nm


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread Chris Jones
On Wed, Oct 21, 2009 at 12:20:35AM EDT, Nobody wrote:
 On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote:

[..]

  Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? 
 
 You can get them from the unicodedata module, e.g.:
 
   import unicodedata
   for i in xrange(0x1):
 n = unicodedata.name(unichr(i),None)
 if n is not None:
   print i, n

Python rocks!

Just curious, why did you choose to set the upper boundary at 0x?

CJ
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread Bruno Desthuilliers

beSTEfar a écrit :
(snip)
 When parsing strings, use Regular Expressions.

And now you have _two_ problems g

For some simple parsing problems, Python's string methods are powerful 
enough to make REs overkill. And for any complex enough parsing (any 
recursive construct for example - think XML, HTML, any programming 
language etc), REs are just NOT enough by themselves - you need a full 
blown parser.


--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread Nobody
On Wed, 21 Oct 2009 05:16:56 -0400, Chris Jones wrote:

  Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? 
 
 You can get them from the unicodedata module, e.g.:
 
  import unicodedata
  for i in xrange(0x1):
n = unicodedata.name(unichr(i),None)
if n is not None:
  print i, n
 
 Python rocks!
 
 Just curious, why did you choose to set the upper boundary at 0x?

Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32
 unichr(0x1)
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: unichr() arg not in range(0x1) (narrow Python build)

Note that narrow builds do understand names outside of the BMP, and
generate surrogate pairs for them:

 u'\N{LINEAR B SYLLABLE B008 A}'
u'\U0001'
 len(_)
2

Whether or not using surrogates in this context is a good idea is open to
debate. What's the advantage of a multi-wchar string over a multi-byte
string?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread rurpy
On Oct 21, 4:59 am, Bruno Desthuilliers bruno.
42.desthuilli...@websiteburo.invalid wrote:
 beSTEfar a écrit :
 (snip)
   When parsing strings, use Regular Expressions.

 And now you have _two_ problems g

 For some simple parsing problems, Python's string methods are powerful
 enough to make REs overkill. And for any complex enough parsing (any
 recursive construct for example - think XML, HTML, any programming
 language etc), REs are just NOT enough by themselves - you need a full
 blown parser.

But keep in mind that many XML, HTML, etc parsing problems
are restricted to a subset where you know the nesting depth
is limited (often to 0 or 1), and for that large set of
problems, RE's *are* enough.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-21 Thread Terry Reedy

Nobody wrote:


Just curious, why did you choose to set the upper boundary at 0x?


Characters outside the 16-bit range aren't supported on all builds. They
won't be supported on most Windows builds, as Windows uses 16-bit Unicode
extensively:

Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32
 unichr(0x1)
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: unichr() arg not in range(0x1) (narrow Python build)


In Python 3, if not 2.6, chr(0x1) (what used to be unichr()) works 
fine on Windows, and generates the appropriate surrogate pair.


--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-20 Thread Scott David Daniels

Mark Tolonen wrote:

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If 
you type non-ASCII characters in source code, make sure to declare the 
encoding the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)



Mark is right about the source, but you needn't write unicode source
to process unicode data.  Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.3
 s = '''48\xc2\xb0 13' 16.80 N'''
 q = s.decode('utf-8')
 degrees, rest = q.split(u'\N{DEGREE SIGN}')
 print degrees
48
 print rest
 13' 16.80 N

And if you are unsure of the name to use:
 import unicodedata
 unicodedata.name(u'\xb0')
'DEGREE SIGN'

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-20 Thread George Trojan
Thanks for all suggestions. It took me a while to find out how to 
configure my keyboard to be able to type the degree sign. I prefer to 
stick with pure ASCII if possible.
Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found 
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt

Is that the place to look?

George

Scott David Daniels wrote:

Mark Tolonen wrote:

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If 
you type non-ASCII characters in source code, make sure to declare the 
encoding the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)



Mark is right about the source, but you needn't write unicode source
to process unicode data.  Since nobody else mentioned my favorite way
of writing unicode in ASCII, try:

IDLE 2.6.3
  s = '''48\xc2\xb0 13' 16.80 N'''
  q = s.decode('utf-8')
  degrees, rest = q.split(u'\N{DEGREE SIGN}')
  print degrees
48
  print rest
 13' 16.80 N

And if you are unsure of the name to use:
  import unicodedata
  unicodedata.name(u'\xb0')
'DEGREE SIGN'

--Scott David Daniels
scott.dani...@acm.org

--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-20 Thread Nobody
On Tue, 20 Oct 2009 17:56:21 +, George Trojan wrote:

 Thanks for all suggestions. It took me a while to find out how to 
 configure my keyboard to be able to type the degree sign. I prefer to 
 stick with pure ASCII if possible.
 Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found 
 http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
 Is that the place to look?

You can get them from the unicodedata module, e.g.:

import unicodedata
for i in xrange(0x1):
  n = unicodedata.name(unichr(i),None)
  if n is not None:
print i, n

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-20 Thread Martin v. Löwis
 Where are the literals (i.e. u'\N{DEGREE SIGN}') defined? I found
 http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt
 Is that the place to look?

Correct - you are supposed to fill in a Unicode character name into
the \N escape. The specific list of names depends on the version of
the UCD which was used in the specific Python version, but the
characters you are likely interested in probably had been defined
forever.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


a simple unicode question

2009-10-19 Thread George Trojan
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

 encoding='iso-8859-1'
 q=s.decode(encoding)
 q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
 r=q.split()[0]
 int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?

George
--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-19 Thread Diez B. Roggisch

George Trojan schrieb:
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

  encoding='iso-8859-1'
  q=s.decode(encoding)
  q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
  r=q.split()[0]
  int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?


Instead of this rather convoluted way to specify a degree-sign, better do

 # -*- coding: utf-8 -*-
 ...
 int(r[:r.find(u°)])


Please note that the utf-8-encoding has *nothing* todo with your string 
- it's just the source-file encoding. Of course your editor must use 
utf-8 for saving the encoding. Or you can use any other one you like.


Diez
--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-19 Thread beSTEfar
On 19 Okt, 21:07, George Trojan george.tro...@noaa.gov wrote:
 A trivial one, this is the first time I have to deal with Unicode. I am
 trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is
 iso-8859-1. To get the degrees I did
   encoding='iso-8859-1'
   q=s.decode(encoding)
   q.split()
 [u'48\xc2\xb0', u13', u'16.80', u'N']
   r=q.split()[0]
   int(r[:r.find(unichr(ord('\xc2')))])
 48

 Is there a better way of getting the degrees?

 George

When parsing strings, use Regular Expressions. If you don't know how
to, spend some time teaching yourself how to - well spent time! A
great tool for playing around with REs is KODOS.

For the problem at hand you can e.g.:

  import re
  degrees = int(re.findall('\d+', s)[0])

that in essence will group together all groups of consecutive digits,
return the first group and int() it. No need to care/know about the
fact that the string is Unicode and the underlying coding of the
charset.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-19 Thread Mark Tolonen


George Trojan george.tro...@noaa.gov wrote in message 
news:hbidd7$i9...@news.nems.noaa.gov...
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

  encoding='iso-8859-1'
  q=s.decode(encoding)
  q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
  r=q.split()[0]
  int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN:


--
http://mail.python.org/mailman/listinfo/python-list


Re: a simple unicode question

2009-10-19 Thread Mark Tolonen


George Trojan george.tro...@noaa.gov wrote in message 
news:hbidd7$i9...@news.nems.noaa.gov...
A trivial one, this is the first time I have to deal with Unicode. I am 
trying to parse a string s='''48° 13' 16.80 N'''. I know the charset is 
iso-8859-1. To get the degrees I did

  encoding='iso-8859-1'
  q=s.decode(encoding)
  q.split()
[u'48\xc2\xb0', u13', u'16.80', u'N']
  r=q.split()[0]
  int(r[:r.find(unichr(ord('\xc2')))])
48

Is there a better way of getting the degrees?


It seems your string is UTF-8.  \xc2\xb0 is UTF-8 for DEGREE SIGN.  If you 
type non-ASCII characters in source code, make sure to declare the encoding 
the file is *actually* saved in:


# coding: utf-8

s = '''48° 13' 16.80 N'''
q = s.decode('utf-8')

# next line equivalent to previous two
q = u'''48° 13' 16.80 N'''

# couple ways to find the degrees
print int(q[:q.find(u'°')])
import re
print re.search(ur'(\d+)°',q).group(1)

-Mark


--
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-30 Thread Nobody
On Sun, 30 Aug 2009 02:36:49 +, Steven D'Aprano wrote:

 So long as your terminal has a sensible encoding, and you have a good
 quality font, you should be able to print any string you can create.
 
 UTF-8 isn't a particularly sensible encoding for terminals.
 
 Did I mention UTF-8?
 
 Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?

I don't think I've ever seen a terminal (whether an emulator running on a
PC or a hardware terminal) which supports anything like the entire Unicode
repertoire, along with right-to-left writing, complex scripts, etc. Even
support for double-width characters is uncommon.

If your terminal can't handle anything outside of ISO-8859-1, there isn't
any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix
tty driver will delete the last *byte* from the input buffer when you
press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard).

Historically, terminal I/O has tended to revolve around unibyte encodings,
with everything except the endpoints being encoding-agnostic. Anything
which falls outside of that is a dog's breakfast; it's no coincidence
that the word for messed-up text (arising from an encoding mismatch)
was borrowed from Japanese (mojibake).

Life is simpler if you can use a unibyte encoding. Apart from anything
else, the failure modes tend to be harmless. E.g. you get the wrong glyph
rather than two glyphs where you expected one. On a 7-bit channel, you get
the wrong printable character rather than a control character (this is why
ISO-8859-* reserves \x80-\x9F as control codes rather than using them as
printable characters).

 And Unicode font is an oxymoron. You can merge a whole bunch of fonts
 together and stuff them into a TTF file; that doesn't make them a
 font, though.
 
 I never mentioned Unicode font either. In any case, there's no reason 
 why a skillful designer can't make a single font which covers the entire 
 Unicode range in a consistent style.

Consistency between unrelated scripts is neither realistic nor
desirable.

E.g. Latin fonts tend to use uniform stroke widths unless they're
specifically designed to look like handwriting, whereas Han fonts tend to
prefer variable-width strokes which reflect the direction.

 The main advantage of using Unicode internally is that you can associate
 encodings with the specific points where data needs to be converted
 to/from bytes, rather than having to carry the encoding details around
 the program.
 
 Surely the main advantage of Unicode is that it gives you a full and 
 consistent range of characters not limited to the 128 characters provided 
 by ASCII?

Nothing stops you from using other encodings, or from using multiple
encodings. But using multiple encodings means keeping track of the
encodings. This isn't impossible, and it may produce better results (e.g.
no information loss from Han unification), but it can be a lot more work.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Thorsten Kampe
* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
  Further, does anything, except a printing device need to know the
  encoding of a piece of text?

Python needs to know if you are processing the text.
 
 I may be wrong, but I believe that's part of the idea between separation  
 of string and bytes types in Python 3.x. I believe, if you are using  
 Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)

Nothing has changed in that regard. You still need to decode and encode 
text and for that you have to know the encoding.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Steven D'Aprano
On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote:

 * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
  Further, does anything, except a printing device need to know the
  encoding of a piece of text?
 
 Python needs to know if you are processing the text.

Python only needs to know when you convert the text to or from bytes. I 
can do this:

 s = hello
 t = world
 print(' '.join([s, t]))
hello world

and not need to care anything about encodings.

So long as your terminal has a sensible encoding, and you have a good 
quality font, you should be able to print any string you can create.



 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo jumbo
 at all ;-)
 
 Nothing has changed in that regard. You still need to decode and encode
 text and for that you have to know the encoding.

You only need to worry about encoding when you convert from bytes to 
text, and visa versa. Admittedly, the most common time you need to do 
that is when reading input from files, but if all your text strings are 
generated by Python, and not output anywhere, you shouldn't need to care 
about encodings.

If all your text contains nothing but ASCII characters, you should never 
need to worry about encodings at all.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Nobody
On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote:

 Python only needs to know when you convert the text to or from bytes. I 
 can do this:
 
 s = hello
 t = world
 print(' '.join([s, t]))
 hello world
 
 and not need to care anything about encodings.
 
 So long as your terminal has a sensible encoding, and you have a good 
 quality font, you should be able to print any string you can create.

UTF-8 isn't a particularly sensible encoding for terminals.

And Unicode font is an oxymoron. You can merge a whole bunch of fonts
together and stuff them into a TTF file; that doesn't make them a font,
though.

 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo jumbo
 at all ;-)
 
 Nothing has changed in that regard. You still need to decode and encode
 text and for that you have to know the encoding.
 
 You only need to worry about encoding when you convert from bytes to 
 text, and visa versa. Admittedly, the most common time you need to do 
 that is when reading input from files, but if all your text strings are 
 generated by Python, and not output anywhere, you shouldn't need to care 
 about encodings.

Why would you generate text strings and not output them anywhere?

The main advantage of using Unicode internally is that you can associate
encodings with the specific points where data needs to be converted
to/from bytes, rather than having to carry the encoding details around the
program.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Steven D'Aprano
On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote:

 On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote:
 
 Python only needs to know when you convert the text to or from bytes. I
 can do this:
 
 s = hello
 t = world
 print(' '.join([s, t]))
 hello world
 
 and not need to care anything about encodings.
 
 So long as your terminal has a sensible encoding, and you have a good
 quality font, you should be able to print any string you can create.
 
 UTF-8 isn't a particularly sensible encoding for terminals.

Did I mention UTF-8?

Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?


 And Unicode font is an oxymoron. You can merge a whole bunch of fonts
 together and stuff them into a TTF file; that doesn't make them a
 font, though.

I never mentioned Unicode font either. In any case, there's no reason 
why a skillful designer can't make a single font which covers the entire 
Unicode range in a consistent style.


 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo
 jumbo at all ;-)
 
 Nothing has changed in that regard. You still need to decode and
 encode text and for that you have to know the encoding.
 
 You only need to worry about encoding when you convert from bytes to
 text, and visa versa. Admittedly, the most common time you need to do
 that is when reading input from files, but if all your text strings are
 generated by Python, and not output anywhere, you shouldn't need to
 care about encodings.
 
 Why would you generate text strings and not output them anywhere?

Who knows? It doesn't matter -- the point is that you can if you want to. 
You only need to worry about encodings at input and output, therefore 
logically if you don't do I/O you can process strings all day long and 
never worry about encodings at all.


 The main advantage of using Unicode internally is that you can associate
 encodings with the specific points where data needs to be converted
 to/from bytes, rather than having to carry the encoding details around
 the program.

Surely the main advantage of Unicode is that it gives you a full and 
consistent range of characters not limited to the 128 characters provided 
by ASCII?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-27 Thread Rami Chowdhury

Further, does anything, except a printing device need to know the
encoding of a piece of text?


I may be wrong, but I believe that's part of the idea between separation  
of string and bytes types in Python 3.x. I believe, if you are using  
Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)


If you're using Python 2.x, though, I believe if you simply set the file  
opening mode to binary then data you read() should still be treated as an  
array of bytes, although you may encounter issues trying to access the  
n'th character.


Please do correct me if I'm wrong, anyone.

On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh  
shashank.sunny.si...@gmail.com wrote:



Hi All!

I have a very simple (and probably stupid) question eluding me.
When exactly is the char-set information needed?

To make my question clear consider reading a file.
While reading a file, all I get is basically an array of bytes.

Now suppose a file has 10 bytes in it (all is data, no metadata,
forget the BOM and stuff for a little while). I read it into an array of  
10

bytes, replace, say, 2nd bytes and write all the bytes back to a new
file.

Do i need the character encoding mumbo jumbo anywhere in this?

Further, does anything, except a printing device need to know the
encoding of a piece of text? I mean, as long as we are not trying
to get a symbolic representation of a text or get ith character
of it, all we need to do is to carry the intended encoding as
an auxiliary information to the data stored as byte array.

Right?

--shashank




--
Rami Chowdhury
Never attribute to malice that which can be attributed to stupidity --  
Hanlon's Razor

408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD)
--
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-27 Thread Albert Hopkins
On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote:
 Hi All!
 
 I have a very simple (and probably stupid) question eluding me.
 When exactly is the char-set information needed?
 
 To make my question clear consider reading a file.
 While reading a file, all I get is basically an array of bytes.
 
 Now suppose a file has 10 bytes in it (all is data, no metadata,
 forget the BOM and stuff for a little while). I read it into an array
 of 10
 bytes, replace, say, 2nd bytes and write all the bytes back to a new
 file. 
 
 Do i need the character encoding mumbo jumbo anywhere in this?
 
 Further, does anything, except a printing device need to know the
 encoding of a piece of text? I mean, as long as we are not trying
 to get a symbolic representation of a text or get ith character
 of it, all we need to do is to carry the intended encoding as
 an auxiliary information to the data stored as byte array.

If you are just reading and writing bytes then you are just reading and
writing bytes.  Where you need to worry about unicode, etc. is when you
start treating a series of bytes as TEXT (e.g. how many *characters* are
in this byte array).* 

This is no different, IMO, than treating a byte stream vs a image file.
You don't, need to worry about resolution, palette, bit-depth, etc. if
you are only treating as a stream of bytes.  The only difference between
the two is that in Python unicode is a built-in type and image
isn't ;)

* Just make sure that if you are manipulating byte streams independent
of it's textual representation that you open files, e.g., in binary
mode.

-a


-- 
http://mail.python.org/mailman/listinfo/python-list