Status: New
Owner: ----

New issue 102 by pipoket88: [Patch] Korean encoding problem patch
http://code.google.com/p/html5lib/issues/detail?id=102

Hello.
I recently used the html5lib to parse *massive* webpages for some
information gathering.
However, html5lib caused some LookupError exception in two encodings,
especially the Korean encoding.
So I did some simple checkup. Following is the simple piece of code to
check what encodings cause this LookupError problem.
import html5lib

encoding_set = set()
error_encoding_set = set()
for encoding in html5lib.constants.encodings.values():
     encoding_set.add(encoding)

a = "TestText"
for encoding in encoding_set:
     try:
         a.encode(encoding)
     except LookupError:
         error_encoding_set.add(encoding)

print error_encoding_set

And the result of the following code is like below.
pipo...@dev:~$ python err_chk.py
set(['windows-874', 'windows-949'])

The encodings that caused the problem were ‘windows-949’ and ‘windows-874’.
As long as I know, these encodings might have been used in cjkcodecs
package before python 2.3.
But for now, these encoding names are deprecated so we have to use ‘cp949’
and ‘cp874’
Changing those encodings to ‘cp949’ and ‘cp874’ resolved the problem.

I also attached the patch that can be applied to the html5lib source.
Thanks.

Woosuk Suh


Attachments:
        html5lib_charset_18c3325ee58c.patch  1.9 KB

--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"html5lib-discuss" group.
 To post to this group, send email to [email protected]
 To unsubscribe from this group, send email to 
[email protected]
 For more options, visit this group at 
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---

Reply via email to