On 2/1/2012 3:26 AM, Stefan Behnel wrote:
Tim Arnold, 31.01.2012 19:09:
I have to follow a specification for producing xhtml files.
The original files are in cp1252 encoding and I must reencode them to utf-8.
Also, I have to replace certain characters with html entities.
---------------------------------
import codecs, StringIO
from lxml import etree
high_chars = {
    0x2014:'—', # 'EM DASH',
    0x2013:'–', # 'EN DASH',
    0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
    0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
    0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
    0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
    0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
    0x2122:'™', # 'TRADE MARK SIGN',
    0x00A9:'©',  # 'COPYRIGHT SYMBOL',
    }
def translate(string):
    s = ''
    for c in string:
        if ord(c) in high_chars:
            c = high_chars.get(ord(c))
        s += c
    return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?

I wasn't aware of it, but I am now--code's embarassing now.
The spec I must follow forces me to do the translation.

I am actually working with html not xhtml; which makes a huge difference, sorry for that.

Ulrich's line of code for translate is elegant.
for c in string:
    s += high_chars.get(c,c)


def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
    with codecs.open(filename,encoding=in_encoding) as f:
        s = f.read()
    sio = StringIO.StringIO(translate(s))
    parser = etree.HTMLParser(encoding=in_encoding)
    tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?

I see that I'm decoding twice now, thanks.

Also, I now see that when lxml writes the result back out the entities I got from my translate function are resolved, which defeats the whole purpose.

    result = etree.tostring(tree.getroot(), method='html',
                            pretty_print=True,
                            encoding=out_encoding)
    with open(filename,'wb') as f:
        f.write(result)

Use tree.write(f, ...)

From the all the info I've received on this thread, plus some additional reading, I think I need the following code.

Use the HTMLParser because the source files are actually HTML, and use output from etree.tostring() as input to translate() as the very last step.

def reencode(filename, in_encoding='cp1252', out_encoding='utf-8'):
    parser = etree.HTMLParser(encoding=in_encoding)
    tree = etree.parse(filename, parser)
    result = etree.tostring(tree.getroot(), method='html',
                            pretty_print=True,
                            encoding=out_encoding)
    with open(filename, 'wb') as f:
        f.write(translate(result))

not simply tree.write(f...) because I have to do the translation at the end, so I get the entities instead of the resolved entities from lxml.

Again, it would be simpler if this was xhtml, but I misspoke (mis-wrote?) when I said xhtml; this is for html.

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

   tree = etree.parse(in_path)
   tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan


thanks everyone for the help.

--Tim Arnold

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to