Re: getting rid of —

MRAB Fri, 03 Jul 2009 09:57:02 -0700

Tep wrote:

On 3 Jul., 16:58, "Mark Tolonen" <[email protected]> wrote:

"Tep" <[email protected]> wrote in message


news:[email protected]...

On 3 Jul., 06:40, Simon Forman <[email protected]> wrote:

On Jul 2, 4:31 am, Tep <[email protected]> wrote:

[snip]

how can I replace '—' sign from string? Or do split at that
character?
Getting unicode error if I try to do it:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
position
1: ordinal not in range(128)
Thanks, Pet
script is # -*- coding: UTF-8 -*-

[snip]

I just tried a bit of your code above in my interpreter here and it
worked fine:
|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'

|>>> data.split(u'—')

|[u'foo ', u' bar']
Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

You'd still benefit from posting some code.  You shouldn't be converting


I've posted code below

back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code".  Also make sure your file
is actually saved in the encoding you declare.  I print the encoding of your
symbol in two encodings to illustrate why I suspect this.


File was indeed in windows-1252, I've changed this. For errors see
below

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works.  Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data.  In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding.  Make sure to
save your source code in the encoding you declare.  If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):
  File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
    exec codeObj in __main__.__dict__
  File "<auto import>", line 1, in <module>
  File "x.py", line 6, in <module>
    print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark


#! /usr/bin/python
# -*- coding: UTF-8 -*-
import urllib2
import re
def getTitle(input):
    title = re.search('<title>(.*?)</title>', input)


The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

    title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)

    title = title.group(1)
    print "FULL TITLE", title.encode('UTF-8')
    parts = title.split(' — ')


The title is Unicode, so the string with which you're splitting should
also be Unicode:

    parts = title.split(u' — ')

    return parts[0]


def getWebPage(url):
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = { 'User-Agent' : user_agent }
    req = urllib2.Request(url, '', headers)
    response = urllib2.urlopen(req)
    the_page = unicode(response.read(), 'UTF-8')
    return the_page


def main():
    url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
    title = getTitle(getWebPage(url))
    print title[0]


if __name__ == "__main__":
    main()


Traceback (most recent call last):
  File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
    main()
  File "C:\user\Projects\test\src\new_main.py", line 24, in main
    title = getTitle(getWebPage(url))
FULL TITLE Ð‘Ð°Ñ…Ñ€ÐµÐ¹Ð½ â€” Ð£Ð¸ÐºÐ¸Ð¿ÐµÐ´Ð¸Ñ�
  File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
    parts = title.split(' â€” ')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position
1: ordinal not in range(128)

--
http://mail.python.org/mailman/listinfo/python-list

Re: getting rid of —

Reply via email to