Hi Danny, So I got my code workin now and it looks like this
TAG = '{http://www.mediawiki.org/xml/export-0.10/}page' doc = etree.iterparse(wiki) for _, node in doc: if node.tag == TAG: title = node.find("{http://www.mediawiki.org/xml/export-0.10/}title ").text if title in page_titles: print (etree.tostring(node)) node.clear() Its mostly giving me what I want. However it is adding extra formatting (I believe name_spaces and attributes). I was wondering if there was a way to strip these out when I'm printing the node tostring? Here is an example of the last few lines of my output: [[Category:Asteroids| ]] [[Category:Spaceflight]]</ns0:text> <ns0:sha1>h4rxxfq37qg30eqegyf4vfvkqn3r142</ns0:sha1> </ns0:revision> </ns0:page> *Joshua Valdez* *Computational Linguist : Cognitive Scientist * (440)-231-0479 jd...@case.edu <j...@uw.edu> | j...@uw.edu | jo...@armsandanchors.com <http://www.linkedin.com/in/valdezjoshua/> On Wed, Jul 1, 2015 at 1:17 AM, Danny Yoo <d...@hashcollision.org> wrote: > Hi Joshua, > > > > The issue you're encountering sounds like XML namespace issues. > > > >> So I tried that code snippet you pointed me too and I'm not getting any > output. > > > This is probably because the tag names of the XML are being prefixed > with namespaces. This would make the original test for node.tag to be > too stingy: it wouldn't exactly match the string we want, because > there's a namespace prefix in front that's making the string mismatch. > > > Try relaxing the condition from: > > if node.tag == "page": ... > > to something like: > > if node.tag.endswith("page"): ... > > > This isn't quite technically correct, but we want to confirm whether > namespaces are the issue that's preventing you from seeing those > pages. > > > If namespaces are the issue, then read: > > http://effbot.org/zone/element-namespaces.htm > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor