Hello, recently I discovered a problem with lxml (Python binding for LibXML2)/LibXML2. But it is likely that the problem comes from Libxml2. Hope, you can help me.
I'm working on a university project where I use Python and lxml for xml parsing and processing. First there were no namespace definitions in the xml files we used but recently the format has slightly changed and some namespace definitions were added. Here the xml format as it was in the beginning: <?xml version="1.0" encoding="UTF-8"?> <D-Spin> <MetaData> <source>IMS, Uni Stuttgart</source> </MetaData> <TextCorpus lang="de"> <text>European Medicines Agency EMEA/H/C/471 [...] Wegen</text> <tokens> <token ID="t1">European</token> <token ID="t2">Medicines</token> [...] <token ID="t145906">Wegen</token> </tokens> <sentences> [...] <sentence ID="s5921" tokenIDs="t145906"/> </sentences> </TextCorpus> </D-Spin> And here the xml with the recently added namespace definitions: <?xml version="1.0" encoding="UTF-8"?> <D-Spin xmlns="http://www.dspin.de/data" version="0.4"> <MetaData xmlns="http://www.dspin.de/data/metadata"> <source>IMS, Uni Stuttgart</source> </MetaData> <TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de"> <text>European Medicines Agency EMEA/H/C/471 [...] Wegen</text> <tokens> <token ID="t1">European</token> <token ID="t2">Medicines</token> [...] <token ID="t145906">Wegen</token> </tokens> <sentences> [...] <sentence ID="s5921" tokenIDs="t145906"/> </sentences> </TextCorpus> </D-Spin> I wanted to extract all the content from the <token> elements. In the xml file without the namespace definitions that takes just a moment (less that 30 seconds). But when I tried to perform the same on the new file with namespaces, it took much longer, more that 30 minutes (!). The xml file was about 7 MB. Since the same problem occurs when one tries to parse the xml file with the LibXML2 binding for Perl, I guess the problem comes from LibXML2 itself. It is also strange that the performance problem seems to grow with the amount of the <token> tags to be parsed. So the first 10 000 tags only need about a second. But when we parse the first 20 000 tags, it takes 21 seconds! Do you have any idea about the cause of this problem and how it could be solved? Thanks Max _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
