Thanks, Stefan. Actually I will have to run the searches I am interested in only a few times and therefore will drop performance concerns.
Thanks for len(text.split()) . I will try it later. The text I am interested in is always in leaf elements. I have posted a concrete example incl. a representative XML file a few minutes ago. I hope this clarifies my problem. Rereading what I wrote sounds admittedly funnny. What I meant that I did not find a post that closely matches my problem (I know that the closeness needed in my case will seem excessive to more experienced Python/XML users). Best regards. Martin P.S. Sorry for my late reply, but my Internet connection was down for a day. > ---------- Forwarded message ---------- > From: Stefan Behnel <stefan...@behnel.de> > To: python-list@python.org > Date: Tue, 16 Mar 2010 08:50:30 +0100 > Subject: Re: extract occurrence of regular expression from elements of XML > documents > Martin Schmidt, 15.03.2010 18:16: > >> I have just started to use Python a few weeks ago and until last week I >> had >> no knowledge of XML. >> Obviously my programming knowledge is pretty basic. >> Now I would like to use Python in combination with ca. 2000 XML documents >> (about 30 kb each) to search for certain regular expression within >> specific >> elements of these documents. >> > > 2000 * 30K isn't a huge problem, that's just 60M in total. If you just have > to do it once, drop your performance concerns and just get a solution going. > If you have to do it once a day, take care to use a tool that is not too > resource consuming. If you have strict requirements to do it once a minute, > use a fast machine with a couple of cores and do it in parallel. If you have > a huge request workload and want to reverse index the XML to do all sorts of > sophisticated queries on it, use a database instead. > > > I would then like to record the number of occurrences of the regular >> expression within these elements. >> Moreover I would like to count the total number of words contained within >> these, >> > > len(text.split()) will give you those. > > BTW, is it document-style XML (with mixed content as in HTML) or is the > text always withing a leaf element? > > > and record the attribute of a higher level element that contains >> them. >> > > An example would certainly help here. > > > I was trying to figure out the best way how to do this, but got >> overwhelmed >> by the available information (e.g. posts using different approaches based >> on >> dom, sax, xpath, elementtree, expat). >> The outcome should be a file that lists the extracted attribute, the >> number >> of occurrences of the regular expression, and the total number of words. >> I did not find a post that addresses my problem. >> > > Funny that you say that after stating that you were overwhelmed by the > available information. > > > If someone could help me with this I would really appreciate it. >> > > Most likely, the solution with the best simplicity/performance trade-off > would be xml.etree.cElementTree's iterparse(), intercept on each interesting > tag name, and search its text/tail using the regexp. That's doable in a > couple of lines. > > But unless you provide more information, it's hard to give better advice. > > Stefan > > > > > ---------- Forwarded message ---------- > From: Chris Rebert <c...@rebertia.com> > To: "Lawrence D'Oliveiro" <l...@geek-central.gen.nz> > Date: Tue, 16 Mar 2010 00:52:07 -0700 > Subject: Re: import antigravity > On Tue, Mar 16, 2010 at 12:40 AM, Lawrence D'Oliveiro > <l...@geek-central.gen.new_zealand> wrote: > > Subtle... > > You're a bit behind the times. > If my calculations are right, that comic is over 2 years old. > > Cheers, > Chris > > > > ---------- Forwarded message ---------- > From: Stefan Behnel <stefan...@behnel.de> > To: python-list@python.org > Date: Tue, 16 Mar 2010 08:51:58 +0100 > Subject: Re: import antigravity > Lawrence D'Oliveiro, 16.03.2010 08:40: > >> Subtle... >> > > Absolutely. > > Python 2.4.6 (#2, Jan 21 2010, 23:45:25) > [GCC 4.4.1] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import antigravity > Traceback (most recent call last): > File "<stdin>", line 1, in ? > ImportError: No module named antigravity > > > Stefan > > > > -- > http://mail.python.org/mailman/listinfo/python-list >
-- http://mail.python.org/mailman/listinfo/python-list