Re: extract occurrence of regular expression from elements of XML documents

Stefan Behnel Tue, 16 Mar 2010 00:59:09 -0700

Martin Schmidt, 15.03.2010 18:16:

I have just started to use Python a few weeks ago and until last week I had
no knowledge of XML.
Obviously my programming knowledge is pretty basic.
Now I would like to use Python in combination with ca. 2000 XML documents
(about 30 kb each) to search for certain regular expression within specific
elements of these documents.

2000 * 30K isn't a huge problem, that's just 60M in total. If you just haveto do it once, drop your performance concerns and just get a solutiongoing. If you have to do it once a day, take care to use a tool that is nottoo resource consuming. If you have strict requirements to do it once aminute, use a fast machine with a couple of cores and do it in parallel. Ifyou have a huge request workload and want to reverse index the XML to doall sorts of sophisticated queries on it, use a database instead.

I would then like to record the number of occurrences of the regular
expression within these elements.
Moreover I would like to count the total number of words contained within
these,


len(text.split()) will give you those.

BTW, is it document-style XML (with mixed content as in HTML) or is thetext always withing a leaf element?

and record the attribute of a higher level element that contains
them.


An example would certainly help here.

I was trying to figure out the best way how to do this, but got overwhelmed
by the available information (e.g. posts using different approaches based on
dom, sax, xpath, elementtree, expat).
The outcome should be a file that lists the extracted attribute, the number
of occurrences of the regular expression, and the total number of words.
I did not find a post that addresses my problem.

Funny that you say that after stating that you were overwhelmed by theavailable information.

If someone could help me with this I would really appreciate it.

Most likely, the solution with the best simplicity/performance trade-offwould be xml.etree.cElementTree's iterparse(), intercept on eachinteresting tag name, and search its text/tail using the regexp. That'sdoable in a couple of lines.


But unless you provide more information, it's hard to give better advice.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: extract occurrence of regular expression from elements of XML documents

Reply via email to