Re: [Tutor] converting encoded symbols from rss feed?

2009-06-19 Thread Serdar Tumgoren
> OK, so newline is unicode, outfile.write() wants a plain string. What
> encoding do you want outfile to be in? Try something like
> outfile.write(newline.encode('utf-8'))
> or use the codecs module to create an output that knows how to encode.

Aha!! The second of the two options above did the trick! It appears I
needed to open my "outfile" with utf-8 encoding. After that, I was
able to write out cleaned lines without any hitches.

Below is the working code. And of course, many thanks for the help!!


infile = open('test.txt','rb')
#infile = codecs.open('test.txt','rb','utf-8')

outfile = codecs.open('test_cleaned.txt','wb','utf-8')


for line in infile:
cleanline = strip_html(translate_code(line)).strip()
if cleanline:
outline = cleanline + '\n'
outfile.write(outline)
else:
continue
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Kent Johnson
On Thu, Jun 18, 2009 at 9:03 PM, Serdar Tumgoren wrote:

> When I run this code:
>
> <<< snip >>>
> for line in infile:
>    cleanline = translate_code(line)
>    newline = strip_html(cleanline)
>    outfile.write(newline)
> <<< snip >>>
>
> ...I receive the below traceback:
>
>   Traceback (most recent call last):
>      File "htmlcleanup.py", line 112, in 
>      outfile.write(newline)
>   UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in
> position 21: ordinal not in range(128)

OK, so newline is unicode, outfile.write() wants a plain string. What
encoding do you want outfile to be in? Try something like
outfile.write(newline.encode('utf-8'))
or use the codecs module to create an output that knows how to encode.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Serdar Tumgoren
Ok, I should say that I managed to "solve" the problem by first
reading and translating the data, and then applying Mr. Lundh's
strip_html function to the resulting lines.

For future reference (and of course any additional feedback), the
working code is here:

http://pastebin.com/f309bf607

But of course that's a Band-Aid approach and I'm still interested in
understanding the root of the problem. To that end, I've attached the
Exception below from the problematic code.

> Your try/except is hiding the problem. What happens if you take it
> out? what error do you get?
>
> My guess is that strip_html() is returning unicode and
> translate_code() is expecting strings but I'm not sure without seeing
> the error.
>

When I run this code:

<<< snip >>>
for line in infile:
cleanline = translate_code(line)
newline = strip_html(cleanline)
outfile.write(newline)
<<< snip >>>

...I receive the below traceback:

   Traceback (most recent call last):
  File "htmlcleanup.py", line 112, in 
  outfile.write(newline)
   UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in
position 21: ordinal not in range(128)
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Kent Johnson
2009/6/18 Serdar Tumgoren :

>> In [7]: print x.encode('cp437')
>> --> print(x.encode('cp437'))
>> abc░
>>
> So does this mean that my python install is incapable of encoding the
> en/em dash?

No, the problem is with the print, not the encoding. Your console, as
configured, is incapable of displaying the em dash.

> But for some reason, I can't seem to get my translate_code function to
> work inside the same loop as Mr. Lundh's html cleanup code. Below is
> the problem code:
>
> infile = open('test.txt','rb')
> outfile = open('test_cleaned.txt','wb')
>
> for line in infile:
>    try:
>        newline = strip_html(line)
>        cleanline = translate_code(newline)
>        outfile.write(cleanline)
>    except:
>        newline = "NOT CLEANED: %s" % line
>        outfile.write(newline)
>
> infile.close()
> outfile.close()
>
> The strip_html function, documented here
> (http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text
> string as far as I can tell. I'm confused why I wouldn't be able to
> further manipulate the string with the "translate_code" function and
> store the result in the "cleanline" variable. When I try this
> approach, none of the translations succeed and I'm left with the same
> HTML gook in the "outfile".

Your try/except is hiding the problem. What happens if you take it
out? what error do you get?

My guess is that strip_html() is returning unicode and
translate_code() is expecting strings but I'm not sure without seeing
the error.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Serdar Tumgoren
> The example is written assuming the console encoding is utf-8. Yours
> seems to be cp437. Try this:

> In [1]: import sys
>
> In [2]: sys.stdout.encoding
> Out[2]: 'cp437'

That is indeed the result that I get as well.


> But there is another problem - \u2013 is an em dash which does not
> appear in cp437, so even giving the correct encoding doesn't work. Try
> this:
> In [6]: x = u"abc\u2591"
>
> In [7]: print x.encode('cp437')
> --> print(x.encode('cp437'))
> abc░
>
So does this mean that my python install is incapable of encoding the
en/em dash?

For the time being, I've gone with treating the symptom rather than
the root problem and created a translate function.

def translate_code(text):
text = text.replace("‘","'")
text = text.replace("’","'")
text = text.replace("“",'"')
text = text.replace("”",'"')
text = text.replace("–","-")
text = text.replace("—","--")
return text

Which of course has led to a new problem. I'm first using Fredrik
Lundh's code to extract random html gobbledygook, then running my
translate function over the file to replace the windows-1252 encoded
characters.

But for some reason, I can't seem to get my translate_code function to
work inside the same loop as Mr. Lundh's html cleanup code. Below is
the problem code:

infile = open('test.txt','rb')
outfile = open('test_cleaned.txt','wb')

for line in infile:
try:
newline = strip_html(line)
cleanline = translate_code(newline)
outfile.write(cleanline)
except:
newline = "NOT CLEANED: %s" % line
outfile.write(newline)

infile.close()
outfile.close()

The strip_html function, documented here
(http://effbot.org/zone/re-sub.htm#unescape-html ), returns a text
string as far as I can tell. I'm confused why I wouldn't be able to
further manipulate the string with the "translate_code" function and
store the result in the "cleanline" variable. When I try this
approach, none of the translations succeed and I'm left with the same
HTML gook in the "outfile".

Is there some way to combine these functions so I can perform all the
processing in one pass?
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Kent Johnson
On Thu, Jun 18, 2009 at 4:37 PM, Serdar Tumgoren wrote:

> On the above link, the section on "Encoding Unicode Byte Streams" has
> the following example:
>
 u = u"abc\u2013"
 print u
> Traceback (most recent call last):
>  File "", line 1, in 
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
> position 3: ordinal not in range(128)
 print u.encode("utf-8")
> abc–
>
> But when I try the same example on my Windows XP machine (with Python
> 2.5.4), I can't get the same results. Instead, it spits out the below
> (hopefully it renders properly and we don't have encoding issues!!!):
>
> $ python
> Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] 
> on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
 x = u"abc\u2013"
 print x
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "C:\Program Files\Python25\lib\encodings\cp437.py", line 12, in encode
>    return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in 
> position
>  3: character maps to 
 x.encode("utf-8")
> 'abc\xe2\x80\x93'
 print x.encode("utf-8")
> abcΓÇô

The example is written assuming the console encoding is utf-8. Yours
seems to be cp437. Try this:
C:\Project\Mango> py
Python 2.6.1 (r261:67517, Dec  4 2008, 16:51:00) [MSC v.1500 32 bit
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.

In [1]: import sys

In [2]: sys.stdout.encoding
Out[2]: 'cp437'

But there is another problem - \u2013 is an em dash which does not
appear in cp437, so even giving the correct encoding doesn't work. Try
this:
In [6]: x = u"abc\u2591"

In [7]: print x.encode('cp437')
--> print(x.encode('cp437'))
abc░


> In a related test, I was unable change the default character encoding
> for the python interpreter from ascii to utf-8. In all cases (cygwin,
> Wing IDE, windows command line), the interpreter reported that I my
> "sys" module does not contain the "setdefaultencoding" method (even
> though this should be part of the module from versions 2.x and above).

sys.defaultencoding is deleted by site.py on python startup.You have
to set the default encoding from within a sitecustomize.py module. But
it's usually better to get a correct understanding of what is going on
and to leave the default encoding alone.

Kent
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Serdar Tumgoren
Hey everyone,

I'm trying to get down to basics with this handy intro on Python encodings:
http://eric.themoritzfamily.com/2008/11/21/python-encodings-and-unicode/

But I'm running into some VERY strange results.

On the above link, the section on "Encoding Unicode Byte Streams" has
the following example:

>>> u = u"abc\u2013"
>>> print u
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in
position 3: ordinal not in range(128)
>>> print u.encode("utf-8")
abc–

But when I try the same example on my Windows XP machine (with Python
2.5.4), I can't get the same results. Instead, it spits out the below
(hopefully it renders properly and we don't have encoding issues!!!):

$ python
Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = u"abc\u2013"
>>> print x
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Program Files\Python25\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position
 3: character maps to 
>>> x.encode("utf-8")
'abc\xe2\x80\x93'
>>> print x.encode("utf-8")
abcΓÇô



I get the above results in python interpreters invoked from both the
Windows command line and in a cygwin shell. HOWEVER -- the test code
works properly (i.e. I get the expected "abc-" when I run the code in
WingIDE 10.1 (version 3.1.8-1).

In a related test, I was unable change the default character encoding
for the python interpreter from ascii to utf-8. In all cases (cygwin,
Wing IDE, windows command line), the interpreter reported that I my
"sys" module does not contain the "setdefaultencoding" method (even
though this should be part of the module from versions 2.x and above).

Can anyone help me untangle this mess?

I'd be indebted!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-18 Thread Serdar Tumgoren
> Some further searching reveals this:
> (yay archives ;))
> http://mail.python.org/pipermail/python-list/2008-April/658644.html
>

Aha! I noticed that 150 was missing from the ISO encoding table and
the source xml is indeed using windows-1252 encoding. That explains
why this appears to be the only character in the xml source that
doesn't seem to get translated by Universal Feed Parser. But I'm now
wondering if the feed parser is using windows-1252 rather than some
other encoding.

The below page provides details on how UFP handles character encodings.

http://www.feedparser.org/docs/character-encoding.html

I'm wondering if there's a way to figure out which encoding UFP uses
when it parses the file.

I didn't have the Universal Encoding Detector
(http://chardet.feedparser.org/) installed when I parsed the xml file.
It's not clear to me whether  UFP requires that library to detect the
encoding or if it's an optional part of it's broader routine for
determining encoding.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-17 Thread Serdar Tumgoren
Hey everyone,
For the moment, I opted to use string replacement as my "solution."

So for the below string containing the HTML decimal represenation for en dash:

>>>x = "The event takes place June 17 – 19"
>>>x.replace('–', '-')
'The event takes place June 17 - 19'

It works in my case since this seems to be the only code that
Universal Feed Parser didn't properly translate, but of course not an
ideal solution. I assume this path will require me to build a
character reference dictionary as I encounter more character codes.

I also tried wrestling with character conversion:

>>>unichr(150)
u'\x96'

Not sure where to go from there...
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-17 Thread Serdar Tumgoren
> Upon searching for – in google, I came up with this:
> http://www.siber-sonic.com/mac/charsetstuff/Soniccharset.html

The character table definitely helps. Thanks.

Some additional googling suggests that I need to unescape HTML
entities. I'm planning to try the below approach from Frederik Lundh.
It relies on the "re" and "htmlentitydefs" modules.

http://effbot.org/zone/re-sub.htm#unescape-html

I'll report back with my results. Meantime, I welcome any other suggestions.

Thanks!
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] converting encoded symbols from rss feed?

2009-06-17 Thread Wayne
On Wed, Jun 17, 2009 at 7:30 AM, Serdar Tumgoren wrote:

> Here are some examples of the encoded characters I'm trying to
> convert:
>
>  – (symbol as it appears in the original xml file)
>  –  (symbol as it appears in ipython shell after
> using Universal Feed Parser)
> 


I've never played around much, but & is the HTML code for an &.

So if you have – it will show up as –.

I have no clue if the latter is any type of special character or something,
though.

Upon searching for – in google, I came up with this:
http://www.siber-sonic.com/mac/charsetstuff/Soniccharset.html

HTH,
Wayne
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor