Re: How do I automate the removal of all non-ascii characters from my code?

Vlastimil Brom Tue, 13 Sep 2011 11:17:06 -0700

2011/9/13 Alec Taylor <[email protected]>:
> Hmm, nothing mentioned so far works for me...
>
> Here's a very small test case:
>
>>>> python -u "Convert to Creole.py"
>  File "Convert to Creole.py", line 1
> SyntaxError: Non-ASCII character '\xe2' in file Convert to Creole.py
> on line 1, but no encoding declared; see
> http://www.python.org/peps/pep-0263.html for details
>>>> Exit Code: 1
>
> Line 1: a=u'''≤'''.encode("ascii", "ignore").decode("ascii")
>
> On Tue, Sep 13, 2011 at 11:33 PM, Vlastimil Brom
> <[email protected]> wrote:
>> 2011/9/13 ron <[email protected]>:
>>>
>>> Depending on the load, you can do something like:
>>>
>>> "".join([x for x in string if ord(x) < 128])
>>>
>>> It's worked great for me in cleaning input on webapps where there's a
>>> lot of copy/paste from varied sources.
>>> --
>>> http://mail.python.org/mailman/listinfo/python-list
>>>
>> Well, for this kind of dirty "data cleaning" you may as well use e.g.
>>
>>>>> u"äteöxt ÛÜÝ wiÉÊËÌthÞßà áânoûüýþn ASɔɕɖCɗɘəɚɛIɗɘəɚɛIεζ iηθιn 
>>>>> жзbetийклweeჟრსn .ტუ..ფ".encode("ascii", "ignore").decode("ascii")
>> u'text  with non ASCII in between ...'
>>>>>
>>
>> vbr
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>


Ok, in that case the encoding probably would be utf-8; \xe2 is just
the first part of the encoded data

>>> u'≤'.encode("utf-8")
'\xe2\x89\xa4'
>>>

Setting this encoding at the beginning of the file, as mentioned
before, might solve the problem while retaining the symbol in question
(or you could move from syntax error to some unicode related error
depending on other circumstances...).

vbr
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: How do I automate the removal of all non-ascii characters from my code?

Reply via email to