Status: New
Owner: ----
New issue 98 by nikolay.panov: Encoding issue: 'ascii' codec instead of
appropriate one.
http://code.google.com/p/html5lib/issues/detail?id=98
This issue is related with the following sentence in the docs: "If no
encoding can be found and the chardet library is available, an attempt will
be made to sniff the encoding from the byte pattern "
* What steps will reproduce the problem?
>>> html=fetch_url('http://www.ixbt.com/news/soft/index.shtml?11/72/39')
>>> p =
html5lib.HTMLParser(tree=html5lib.treebuilders.getTreeBuilder("beautifulsoup"))
>>> soup = p.parse(html)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 177,
in parse
self._parse(stream, innerHTML=False, encoding=encoding)
File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 93, in
_parse
self.mainLoop()
File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 149,
in mainLoop
self.phase.processStartTag(token["name"], token["data"])
File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 314,
in processStartTag
self.startTagHandler[name](name, attributes)
File "/home/niksite/lib/site-python/html5lib/html5parser.py", line 605,
in startTagMeta
data = inputstream.EncodingBytes(attributes["content"])
UnicodeEncodeError: 'ascii' codec can't encode characters in position
12-18: ordinal not in range(128)
>>> chardet.detect(html)
{'confidence': 0.94890270449856784, 'encoding': 'windows-1251'}
* What is the expected output? What do you see instead?
As we can see, chardet successfully detect the 'windows-1251' encoding of
the html document provided.
Why html5lib try to use 'ascii' codec?
--
You received this message because you are listed in the owner
or CC fields of this issue, or because you starred this issue.
You may adjust your issue notification preferences at:
http://code.google.com/hosting/settings
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"html5lib-discuss" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/html5lib-discuss?hl=en-GB
-~----------~----~----~----~------~----~------~--~---