Horst Gutmann wrote:

> I currently have quite a big problem with minidom and special chars (for 
> example ü)  in HTML.
>
> Let's say I have following input file:
> --------------------------------------------------
> <?xml version="1.0"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
>             "http://www.w3.org/TR/html4/strict.dtd";>
> <html>
> <body>
> &uuml;
> </body>
> </html>
> --------------------------------------------------

 > test3.html only has a blank line where should be the &uuml; It is simply
> removed.
>
> Any idea how I could solve this problem?

umm.  doesn't that doctype point to an SGML DTD?  even if minidom did fetch
external DTD's (I don't think it does), it would probably choke on that DTD.

running your documents through "tidy -asxml -numeric" before parsing them as
XML might be a good idea...

    http://tidy.sourceforge.net/ (command-line binaries, library)
    http://utidylib.berlios.de/ (python bindings)

</F> 



-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to