Re: xhtml encoding question

Tim Arnold Wed, 01 Feb 2012 21:13:11 -0800

On 2/1/2012 3:26 AM, Stefan Behnel wrote:

Tim Arnold, 31.01.2012 19:09:

I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
    0x2014:'&mdash;', # 'EM DASH',
    0x2013:'&ndash;', # 'EN DASH',
    0x0160:'&Scaron;',# 'LATIN CAPITAL LETTER S WITH CARON',
    0x201d:'&rdquo;', # 'RIGHT DOUBLE QUOTATION MARK',
    0x201c:'&ldquo;', # 'LEFT DOUBLE QUOTATION MARK',
    0x2019:"&rsquo;", # 'RIGHT SINGLE QUOTATION MARK',
    0x2018:"&lsquo;", # 'LEFT SINGLE QUOTATION MARK',
    0x2122:'&trade;', # 'TRADE MARK SIGN',
    0x00A9:'&copy;',  # 'COPYRIGHT SYMBOL',
    }
def translate(string):
    s = ''
    for c in string:
        if ord(c) in high_chars:
            c = high_chars.get(ord(c))
        s += c
    return s


I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?

I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.

I am actually working with html not xhtml; which makes a hugedifference, sorry for that.


Ulrich's line of code for translate is elegant.
for c in string:
    s += high_chars.get(c,c)

def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
    with codecs.open(filename,encoding=in_encoding) as f:
        s = f.read()
    sio = StringIO.StringIO(translate(s))
    parser = etree.HTMLParser(encoding=in_encoding)
    tree = etree.parse(sio, parser)


Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?

I see that I'm decoding twice now, thanks.

Also, I now see that when lxml writes the result back out the entities Igot from my translate function are resolved, which defeats the wholepurpose.

    result = etree.tostring(tree.getroot(), method='html',
                            pretty_print=True,
                            encoding=out_encoding)
    with open(filename,'wb') as f:
        f.write(result)


Use tree.write(f, ...)

From the all the info I've received on this thread, plus someadditional reading, I think I need the following code.

Use the HTMLParser because the source files are actually HTML, and useoutput from etree.tostring() as input to translate() as the very last step.


def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
    parser = etree.HTMLParser(encoding=in_encoding)
    tree = etree.parse(filename, parser)
    result = etree.tostring(tree.getroot(), method='html',
                            pretty_print=True,
                            encoding=out_encoding)
    with open(filename, 'wb') as f:
        f.write(translate(result))

not simply tree.write(f...) because I have to do the translation at theend, so I get the entities instead of the resolved entities from lxml.

Again, it would be simpler if this was xhtml, but I misspoke(mis-wrote?) when I said xhtml; this is for html.

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

   tree = etree.parse(in_path)
   tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan


thanks everyone for the help.

--Tim Arnold

--
http://mail.python.org/mailman/listinfo/python-list

Re: xhtml encoding question

Reply via email to