Martin Schmidt, 15.03.2010 18:16:
I have just started to use Python a few weeks ago and until last week I had
no knowledge of XML.
Obviously my programming knowledge is pretty basic.
Now I would like to use Python in combination with ca. 2000 XML documents
(about 30 kb each) to search for certain regular expression within specific
elements of these documents.

2000 * 30K isn't a huge problem, that's just 60M in total. If you just have to do it once, drop your performance concerns and just get a solution going. If you have to do it once a day, take care to use a tool that is not too resource consuming. If you have strict requirements to do it once a minute, use a fast machine with a couple of cores and do it in parallel. If you have a huge request workload and want to reverse index the XML to do all sorts of sophisticated queries on it, use a database instead.


I would then like to record the number of occurrences of the regular
expression within these elements.
Moreover I would like to count the total number of words contained within
these,

len(text.split()) will give you those.

BTW, is it document-style XML (with mixed content as in HTML) or is the text always withing a leaf element?


and record the attribute of a higher level element that contains
them.

An example would certainly help here.


I was trying to figure out the best way how to do this, but got overwhelmed
by the available information (e.g. posts using different approaches based on
dom, sax, xpath, elementtree, expat).
The outcome should be a file that lists the extracted attribute, the number
of occurrences of the regular expression, and the total number of words.
I did not find a post that addresses my problem.

Funny that you say that after stating that you were overwhelmed by the available information.


If someone could help me with this I would really appreciate it.

Most likely, the solution with the best simplicity/performance trade-off would be xml.etree.cElementTree's iterparse(), intercept on each interesting tag name, and search its text/tail using the regexp. That's doable in a couple of lines.

But unless you provide more information, it's hard to give better advice.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to