Re: getting rid of —

Mark Tolonen Fri, 03 Jul 2009 08:00:19 -0700

"Tep" <[email protected]> wrote in messagenews:[email protected]...

On 3 Jul., 06:40, Simon Forman <[email protected]> wrote:
> On Jul 2, 4:31 am, Tep <[email protected]> wrote:

[snip]

> > > > > how can I replace '—' sign from string? Or do split at that> > > > > character?
> > > > > Getting unicode error if I try to do it:
>
> > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in> > > > > position
> > > > > 1: ordinal not in range(128)
>
> > > > > Thanks, Pet
>
> > > > > script is # -*- coding: UTF-8 -*-

[snip]

> I just tried a bit of your code above in my interpreter here and it
> worked fine:
>
> |>>> data = 'foo — bar'
> |>>> data.split('—')
> |['foo ', ' bar']
> |>>> data = u'foo — bar'
|>>> data.split(u'—')
> |[u'foo ', u' bar']
>
> Figure out the smallest piece of "html source code" that causes the
> problem and include that with your next post.


The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

You'd still benefit from posting some code. You shouldn't be convertingback to utf-8 to do a split, you should be using a Unicode string with spliton the Unicode version of the "html source code". Also make sure your fileis actually saved in the encoding you declare. I print the encoding of yoursymbol in two encodings to illustrate why I suspect this.


Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')


OUTPUT:

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):

File"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",line 427, in ImportFile

   exec codeObj in __main__.__dict__
 File "<auto import>", line 1, in <module>
 File "x.py", line 6, in <module>
   print data.split('—')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:ordinal not in range(128)

Note that using the Unicode string in split() works. Also note the decodebyte in the error message when using a non-Unicode string to split theUnicode data. In your original error message the decode byte that caused anerror was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure tosave your source code in the encoding you declare. If I save the abovescript in windows-1252 encoding and change the coding line to windows-1252 Iget the same results, but the decode byte is 0x97.


# coding: windows-1252
data = u'foo — bar'
print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'
[u'foo ', u' bar']
Traceback (most recent call last):

File"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",line 427, in ImportFile

   exec codeObj in __main__.__dict__
 File "<auto import>", line 1, in <module>
 File "x.py", line 6, in <module>
   print data.split('ק)

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:ordinal not in range(128)


-Mark


--
http://mail.python.org/mailman/listinfo/python-list

Re: getting rid of —

Reply via email to