Re: problems with  character

2005-03-23 Thread John Machin
On Tue, 22 Mar 2005 21:39:30 -, "Claudio Grondi"
<[EMAIL PROTECTED]> wrote:


>In my ASCII table 'Â' is '\xC2'

You've got an *ASCII* table that includes that??

I hope you paid for it in Confederate dollars or czarist roubles --
that's about what such a table would be worth.



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-23 Thread jdonnell
Thanks everyone, I got it working earlier this morning using deelan's
suggestion. I modified the code in his link so that it removes rather
than replaces the characters.

Also, this was my first experience with unicode and what confused me is
that I was thinking of a unicode object as an encoding, but it's not.
It's just a series of bytes and you later tell it to use a specific
encoding like utf-8 or latin-1. Thanks again for all the help.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Re: problems with  character

2005-03-23 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, jdonnell wrote:

> Thanks for all the replies. I just got in to work so I haven't tried
> any of them yet. I see that I wasn't as clear as I should have been so
> I'll clarify a little. I'm grabbing some data from msn's rss feed.
> Here's an example.
> http://search.msn.com/results.aspx?q=domain+name&format=rss&FORM=ZZRE

Then you are getting UTF-8 encoded strings.

> The string ' all domain name extensions  Good' is where I have a
> problem. The
> 'Â' shows up as  'Ã  Ã  ÃÂ' when I write it to a file or stick
> it in mysql. I did a hex dump and this is what I see.
> 
> [EMAIL PROTECTED]:~/scripts> cat test.txt
> extensions  Good
> [EMAIL PROTECTED]:~/scripts> xxd test.txt
> 000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0  extensions .. ..
> 010: 20c2 bb20 476f 6f64 0a.. Good
> 
> One thing that jumps out is that two of the Ã's are c2a0, but one of
> them is c2bb. Well, those are the details since I wasn't clear before.

That are two no-break spaces and a 'Â' character::

  In [42]: import unicodedata

  In [43]: unicodedata.name('\xc2\xa0'.decode('utf-8'))
  Out[43]: 'NO-BREAK SPACE'

  In [44]: unicodedata.name('\xc2\xbb'.decode('utf-8'))
  Out[44]: 'RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK'

Ciao,
Marc 'BlackJack' Rintsch
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-23 Thread jdonnell
Thanks for all the replies. I just got in to work so I haven't tried
any of them yet. I see that I wasn't as clear as I should have been so
I'll clarify a little. I'm grabbing some data from msn's rss feed.
Here's an example.
http://search.msn.com/results.aspx?q=domain+name&format=rss&FORM=ZZRE

The string ' all domain name extensions » Good' is where I have a
problem. The
'»' shows up as  '    »' when I write it to a file or stick
it in mysql. I did a hex dump and this is what I see.

[EMAIL PROTECTED]:~/scripts> cat test.txt
extensions » Good
[EMAIL PROTECTED]:~/scripts> xxd test.txt
000: 6578 7465 6e73 696f 6e73 20c2 a020 c2a0  extensions .. ..
010: 20c2 bb20 476f 6f64 0a.. Good

One thing that jumps out is that two of the Â's are c2a0, but one of
them is c2bb. Well, those are the details since I wasn't clear before.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Re: problems with  character

2005-03-22 Thread Bengt Richter
On Tue, 22 Mar 2005 20:09:55 -0600, "John Roth" <[EMAIL PROTECTED]> wrote:

>I had this problem recently. It turned out that something
>had encoded a unicode string into utf-8. When I found
>the culprit and fixed the underlying design issue, it went away.
>
>John Roth
>
>
>
>"jdonnell" <[EMAIL PROTECTED]> wrote in message 
>news:[EMAIL PROTECTED]
>I have a mysql database with characters like      » in it. I'm
>trying to write a python script to remove these, but I'm having a
>really hard time.
>
>These strings are coming out as type 'str' not 'unicode' so I tried to
>just
>
>record[4].replace('Â', '')
>
>but this does nothing. However the following code works
>
>#!/usr/bin/python
>
>s = 'a  aaa'
>print type(s)
>print s
>print s.find('Â')
>
>This returns
>
>a  aaa
>6
>
>The other odd thing is that the  character shows up as two spaces if
>I print it to the terminal from mysql, but it shows up as  when I
>print from the simple script above.
>What am I doing wrong?
>
What encodings are involved? 

This is from idle on windows, which seems to display latin-1 source ok:
 
 >>> "Latin-1:»\n".decode('latin-1')
 u'Latin-1:\xc2\xbb\n'
 >>> "Latin-1:»\n".decode('latin-1').encode('cp437', 'replace')
 'Latin-1:?\xaf\n'
 >>> "Latin-1:»\n".decode('latin-1').encode('cp437', 'ignore')
 'Latin-1:\xaf\n'
 >>> u'Latin-1:\xc2\xbb\n'.encode('cp437','replace')
 'Latin-1:?\xaf\n'
 >>> 
 
Now this is in an NT4 console windows with code page 437:

 
 >>> u'Latin-1:\xc2\xbb\n'.encode('cp437','replace')
 'Latin-1:?\xaf\n'
 >>> import sys
 >>> sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','replace'))
 Latin-1:?»
 

Notice that the interactive output does a repr that creates the \xaf, but
the character is available and can be written non-repr'd via sys.stdout.write.

For the heck of it:

 >>> sys.stdout.write(u'Latin-1:\xc2\xbb\n'.encode('cp437','xmlcharrefreplace'))
 Latin-1:»

I don't know if this is going to get through to your screen ;-)

Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread John Roth
I had this problem recently. It turned out that something
had encoded a unicode string into utf-8. When I found
the culprit and fixed the underlying design issue, it went away.
John Roth

"jdonnell" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
I have a mysql database with characters like      » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just
record[4].replace('Â', '')
but this does nothing. However the following code works
#!/usr/bin/python
s = 'a  aaa'
print type(s)
print s
print s.find('Â')
This returns

a  aaa
6
The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above.
What am I doing wrong?
--
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread Marc 'BlackJack' Rintsch
In <[EMAIL PROTECTED]>, jdonnell wrote:

> I have a mysql database with characters like  Ã  Ã  ÃÂ in it. I'm
> trying to write a python script to remove these, but I'm having a
> really hard time.
>
> [...]
>
> The other odd thing is that the à character shows up as two spaces if
> I print it to the terminal from mysql, but it shows up as à when I
> print from the simple script above. 
> What am I doing wrong?

Is it possible that your DB stores strings UTF-8 encoded?  The
byte sequence '\xc2\xa0' which displays as 'Ã ' in latin-1 encoding is a
non breakable space character.

Ciao,
Marc 'BlackJack' Rintsch

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread Do Re Mi chel La Si Do
And this run OK for me :

s = 'a  aaa'
print s
print s.replace('Â', '')



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread Do Re Mi chel La Si Do
a  aaa'
0123456


It's OK



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread Claudio Grondi

"Claudio Grondi" <[EMAIL PROTECTED]> schrieb im Newsbeitrag
news:[EMAIL PROTECTED]
> >>s = 'a  aaa'
> >>What am I doing wrong?
>
> First get rid of characters not allowed
> in Python code.
> Replace  with appropriate escape
> sequence: /x## where ##  is the (should be \x##)
> hexadecimal code of the ASCII
> character.
>
> Claudio

i.e. probably instead of 'a  aaa'
'a \xC2 aaa'
In my ASCII table 'Â' is '\xC2'

Claudio


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread Claudio Grondi
>>s = 'a  aaa'
>>What am I doing wrong?

First get rid of characters not allowed
in Python code.
Replace  with appropriate escape
sequence: /x## where ##  is the
hexadecimal code of the ASCII
character.

Claudio


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: problems with  character

2005-03-22 Thread deelan
jdonnell wrote:
I have a mysql database with characters like      » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.
use the "hammer" recipe. i'm using it to create URL-friendly
fragment from latin-1 album titles:

(check the last comment, "a cleaner solution"
for a better implementation).
it basically hammers down accented chars like à and Â
to the most near ASCII representation.
since you receive string data as str from mysql
object first convert them as unicode with:
u = unicode('Â', 'latin-1')
then feed u to the hammer function (the fix_unicode at the
end).
HTH,
deelan
--
"Però è bello sapere che, di questi tempi spietati, almeno
un mistero sopravvive: l'età di Afef Jnifen." -- dagospia.com
--
http://mail.python.org/mailman/listinfo/python-list


problems with  character

2005-03-22 Thread jdonnell
I have a mysql database with characters like      » in it. I'm
trying to write a python script to remove these, but I'm having a
really hard time.

These strings are coming out as type 'str' not 'unicode' so I tried to
just

record[4].replace('Â', '')

but this does nothing. However the following code works

#!/usr/bin/python

s = 'a  aaa'
print type(s)
print s
print s.find('Â')

This returns

a  aaa
6

The other odd thing is that the  character shows up as two spaces if
I print it to the terminal from mysql, but it shows up as  when I
print from the simple script above. 
What am I doing wrong?

--
http://mail.python.org/mailman/listinfo/python-list