On Tue, Mar 16, 2010 at 11:56 AM, Martin Schmidt <martin.schmi...@gmail.com>wrote:
> Thanks, Stefan. > Actually I will have to run the searches I am interested in only a few > times and therefore will drop performance concerns. > > Thanks for len(text.split()) . > I will try it later. > > The text I am interested in is always in leaf elements. > > I have posted a concrete example incl. a representative XML file a few > minutes ago. > I hope this clarifies my problem. > > Rereading what I wrote sounds admittedly funnny. > What I meant that I did not find a post that closely matches my problem (I > know that the closeness needed in my case will seem excessive to more > experienced Python/XML users). > > Best regards. > > Martin > > > P.S. Sorry for my late reply, but my Internet connection was down for a > day. > > > > >> ---------- Forwarded message ---------- >> From: Stefan Behnel <stefan...@behnel.de> >> To: python-list@python.org >> Date: Tue, 16 Mar 2010 08:50:30 +0100 >> Subject: Re: extract occurrence of regular expression from elements of XML >> documents >> Martin Schmidt, 15.03.2010 18:16: >> >>> I have just started to use Python a few weeks ago and until last week I >>> had >>> no knowledge of XML. >>> Obviously my programming knowledge is pretty basic. >>> Now I would like to use Python in combination with ca. 2000 XML documents >>> (about 30 kb each) to search for certain regular expression within >>> specific >>> elements of these documents. >>> >> >> 2000 * 30K isn't a huge problem, that's just 60M in total. If you just >> have to do it once, drop your performance concerns and just get a solution >> going. If you have to do it once a day, take care to use a tool that is not >> too resource consuming. If you have strict requirements to do it once a >> minute, use a fast machine with a couple of cores and do it in parallel. If >> you have a huge request workload and want to reverse index the XML to do all >> sorts of sophisticated queries on it, use a database instead. >> >> >> I would then like to record the number of occurrences of the regular >>> expression within these elements. >>> Moreover I would like to count the total number of words contained within >>> these, >>> >> >> len(text.split()) will give you those. >> >> BTW, is it document-style XML (with mixed content as in HTML) or is the >> text always withing a leaf element? >> >> >> and record the attribute of a higher level element that contains >>> them. >>> >> >> An example would certainly help here. >> >> >> I was trying to figure out the best way how to do this, but got >>> overwhelmed >>> by the available information (e.g. posts using different approaches based >>> on >>> dom, sax, xpath, elementtree, expat). >>> The outcome should be a file that lists the extracted attribute, the >>> number >>> of occurrences of the regular expression, and the total number of words. >>> I did not find a post that addresses my problem. >>> >> >> Funny that you say that after stating that you were overwhelmed by the >> available information. >> >> >> If someone could help me with this I would really appreciate it. >>> >> >> Most likely, the solution with the best simplicity/performance trade-off >> would be xml.etree.cElementTree's iterparse(), intercept on each interesting >> tag name, and search its text/tail using the regexp. That's doable in a >> couple of lines. >> >> But unless you provide more information, it's hard to give better advice. >> >> Stefan >> >> >> >> >> ---------- Forwarded message ---------- >> From: Chris Rebert <c...@rebertia.com> >> To: "Lawrence D'Oliveiro" <l...@geek-central.gen.nz> >> Date: Tue, 16 Mar 2010 00:52:07 -0700 >> Subject: Re: import antigravity >> On Tue, Mar 16, 2010 at 12:40 AM, Lawrence D'Oliveiro >> <l...@geek-central.gen.new_zealand> wrote: >> > Subtle... >> >> You're a bit behind the times. >> If my calculations are right, that comic is over 2 years old. >> >> Cheers, >> Chris >> >> >> >> ---------- Forwarded message ---------- >> From: Stefan Behnel <stefan...@behnel.de> >> To: python-list@python.org >> Date: Tue, 16 Mar 2010 08:51:58 +0100 >> Subject: Re: import antigravity >> Lawrence D'Oliveiro, 16.03.2010 08:40: >> >>> Subtle... >>> >> >> Absolutely. >> >> Python 2.4.6 (#2, Jan 21 2010, 23:45:25) >> [GCC 4.4.1] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> import antigravity >> Traceback (most recent call last): >> File "<stdin>", line 1, in ? >> ImportError: No module named antigravity >> >> >> Stefan >> >> >> >> -- >> >> http://mail.python.org/mailman/listinfo/python-list >> > >
-- http://mail.python.org/mailman/listinfo/python-list