Hi Lucas,
On Sat, Oct 01, 2016 at 10:45:00AM +0200, Lucas Nussbaum wrote:
> During a rebuild of all packages in sid, your package failed to build
> on amd64. [...]
>> expected testauto/output.html to contain 'expand(char(ø))'
Thanks for reporting this. After a bit of digging, I've found that it's
caused by a non-backwards-compatible change in libtidy (which rawdog
uses via the python-libtidy bindings): in libtidy 0.99, the input and
output encodings defaulted to ASCII, whereas libtidy 5 defaults them to
UTF-8. The result is that libtidy takes the HTML that rawdog has already
converted to ASCII, and expands the character references into UTF-8
characters.
On jessie, with libtidy-0.99.0 20091223cvs-1.4+deb8u1:
$ python -c 'import tidylib; print repr(tidylib.tidy_document("È",
{"numeric_entities": 1, "output_html": 1})[0])'
'<html>\n <head>\n <title></title>\n </head>\n <body>\n È\n
</body>\n</html>\n'
On sid, with libtidy5 5.2.0-2:
$ python -c 'import tidylib; print repr(tidylib.tidy_document("È",
{"numeric_entities": 1, "output_html": 1})[0])'
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">\n<html>\n <head>\n
<title></title>\n </head>\n <body>\n \xc3\x88\n </body>\n</html>\n'
Specifying the input and output encodings explicitly as ASCII fixes
this. I've made the following change in upstream rawdog, and it'll be
fixed in rawdog 2.22:
diff --git a/rawdoglib/rawdog.py b/rawdoglib/rawdog.py
index d1d4e4c..8a6702a 100644
--- a/rawdoglib/rawdog.py
+++ b/rawdoglib/rawdog.py
@@ -136,6 +136,8 @@ def sanitise_html(html, baseurl, inline, config):
if config["tidyhtml"]:
args = {
"numeric_entities": 1,
+ "input_encoding": "ascii",
+ "output_encoding": "ascii",
"output_html": 1,
"output_xhtml": 0,
"output_xml": 0,
Cheers,
--
Adam Sampson <[email protected]> <http://offog.org/>