Hi so I figured out my problem, with this code and its working great but its still taking a very long time to process...I was wondering if there was a way to do this with just regular expressions instead of parsing the text with lxml...
the idea would be to identify a <page> tag and then move to the next line of a file to see if there is a match between the title text and the pages in my pages file. I would then want to write the entire page tag <page>fdsalkfdjadslf</page> text to ouput... So again, my pages are just an array like: [Anarchism, Abrahamic Mythology, ...] I'm a little confused as to how to even start this my initial idea was something like this but I'm not sure how to execute it: wiki --> XML file page_titles -> array of strings corresponding to titles tag = r '(<page>)' wiki = wiki.readlines() for line in wiki: page = re.search(tag,line) if page: ......(I'm not sure what to do here) is it possible to look ahead in a loop to discover other lines and then backtrack? I think this may be the solution but again I'm not sure how I would execute such a command structure... *Joshua Valdez* *Computational Linguist : Cognitive Scientist * (440)-231-0479 jd...@case.edu <j...@uw.edu> | j...@uw.edu | jo...@armsandanchors.com <http://www.linkedin.com/in/valdezjoshua/> On Wed, Jul 1, 2015 at 10:13 AM, Joshua Valdez <jd...@case.edu> wrote: > Hi Danny, > > So I got my code workin now and it looks like this > > TAG = '{http://www.mediawiki.org/xml/export-0.10/}page' > doc = etree.iterparse(wiki) > > for _, node in doc: > if node.tag == TAG: > title = node.find("{ > http://www.mediawiki.org/xml/export-0.10/}title").text > if title in page_titles: > print (etree.tostring(node)) > node.clear() > Its mostly giving me what I want. However it is adding extra formatting > (I believe name_spaces and attributes). I was wondering if there was a way > to strip these out when I'm printing the node tostring? > > Here is an example of the last few lines of my output: > > [[Category:Asteroids| ]] > [[Category:Spaceflight]]</ns0:text> > <ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1> > </ns0:revision> > </ns0:page> > > > > > > > *Joshua Valdez* > *Computational Linguist : Cognitive Scientist > * > > (440)-231-0479 > jd...@case.edu <j...@uw.edu> | j...@uw.edu | jo...@armsandanchors.com > <http://www.linkedin.com/in/valdezjoshua/> > > On Wed, Jul 1, 2015 at 1:17 AM, Danny Yoo <d...@hashcollision.org> wrote: > >> Hi Joshua, >> >> >> >> The issue you're encountering sounds like XML namespace issues. >> >> >> >> So I tried that code snippet you pointed me too and I'm not getting >> any output. >> >> >> This is probably because the tag names of the XML are being prefixed >> with namespaces. This would make the original test for node.tag to be >> too stingy: it wouldn't exactly match the string we want, because >> there's a namespace prefix in front that's making the string mismatch. >> >> >> Try relaxing the condition from: >> >> if node.tag == "page": ... >> >> to something like: >> >> if node.tag.endswith("page"): ... >> >> >> This isn't quite technically correct, but we want to confirm whether >> namespaces are the issue that's preventing you from seeing those >> pages. >> >> >> If namespaces are the issue, then read: >> >> http://effbot.org/zone/element-namespaces.htm >> > > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor